CS计算机代考程序代写 data mining database algorithm CS699

CS699
Lecture 9 Correlation Analysis Other Frequent Pattern Mining

Association Rule Mining on Weka
 Data preparation
 When performing association rule mining on a transactional data using Weka, the dataset must be converted to an appropriate form.
 Each item becomes an attribute.
 Each attribute takes on only single value, e.g., {1} or {t}
 Only items are used (i.e., transaction id’s, customer id’s, etc. are removed, temporarily or permanently).
2

Association Rule Mining on Weka
 Data preparation example CID Items
C1 beer, bread, chip, egg
C2 beer, bread, chip, egg, popcorn,
C3 bread, chip, egg
C4 beer, bread, chip, egg, milk, popcorn
C5 beer, bread, milk
C6 beer, bread, egg
C7 bread, chip, milk
C8 bread, butter, chip, egg, milk
C9 butter, chip, egg
3

Association Rule Mining on Weka
 Running Apriori on Weka
 Starts with min. support of 100% and decreases this in steps of 5% until there are at least 10 rules with the min. confidence of 90% or until the support has reached a lower bound of 10%.
 These default values can be changed.
4

Interestingness Measure: Correlations (Lift)
 play basketball  eat cereal [40%, 66.7%] is misleading
 The overall % of students eating cereal is 75% > 66.7%.
 play basketball  not eat cereal [20%, 33.3%] is more accurate, although with lower support and confidence
 Measureofdependent/correlatedevents:lift
lift P(AB) P(A)P(B)
Cereal Not cereal Sum(col.)
Basketball Not basketball 2000 1750
Sum (row) 3750
2000/5000
3000 / 5000 * 3750 / 5000
 0.89
1000 250 3000 2000
1250 5000
lift(B,C) 
lift(B,C)  1000/5000
1.33
3000 / 5000 *1250 / 5000
 lift>1:positivelycorrelated,lift<1:negativelycorrelated, lift = 1: independent 5 Cereal Not cereal Sum(col.) Basketball 2000 (2250) Not basketball 1750 (1500) Sum (row) 3750 1000 (750) 3000 250 (500) 2000 1250 5000 Chi-square Test  First,wecomputetheexpectedvalues(shownintheparentheses) Example: For (cereal, basketball) Expected value = (3750 * 3000) / 5000 = 2250  Second,computethechi-squareteststatistic:  2  (2000  2250)2  (1750 1500)2  (1000  750)2  (250  500)2  277.78 2250 1500 750 500 6 20.05,1 3.84 Chi-square Test  Third, look up the chi‐square distribution table. degrees of freedom = (num_rows – 1) * (num_cols – 1) = 1, and α = 0.05 7 Chi-square Test  Finally, compare the computed test statistic with the value from the distribution table and make a conclusion.  If the computed χ2 statistic is greater than the χ2 from the distribution table, we reject the null hypothesis. Otherwise, we cannot reject the null hypothesis.  In this example, the computed chi‐square value is greater than that from the chi‐square distribution table (i.e., it is in the rejection region).  So, we reject the null hypothesis and conclude that there is a correlation between the two. 8 coffee Not coffee Sum(col.) milk 100 Not milk 1,000 Sum (row) 1,100 lift(m,c)  100/102200 1200/102200*1100/102200  7.74 1,100 1,200 100,000 101,000 101,100 102,200 Null Transactions  When the number of null transactions is large, these measures may generate misleading results.  The lift measure indicates they are positively correlated.  But, actual data says they are negatively correlated.  Among 1,100 people who bought coffee, only 100 (or only 9%) bought also milk. This is similar with those who bought milk. 9 correlated coffee Not coffee Sum(col.) milk 100 Not milk 1,000 Sum (row) 1,100  𝑎𝑙𝑙_𝑐𝑜𝑛𝑓􏰁𝑚, 𝑐􏰈 􏰉  𝑐𝑜𝑠𝑖𝑛𝑒𝑚,𝑐 􏰉 􏰓􏰔􏰕􏰁􏰍∪􏰎􏰈 􏰉 􏰅􏰃􏰃 􏰉0.09 􏰅􏰒􏰃􏰃􏰖􏰅􏰅􏰃􏰃 1,100 1,200 100,000 101,000 101,100 102,200 all_confidence and cosine  Between 0 and 1  greater than 0.5: positively correlated; smaller than 0.5: negatively 􏰊􏰋􏰌 􏰍∪􏰎 􏰏􏰐􏰑 􏰊􏰋􏰌 􏰍 ,􏰊􏰋􏰌 􏰎 􏰉 􏰅􏰃􏰃 􏰉 0.08 􏰅􏰒􏰃􏰃 􏰊􏰋􏰌􏰁􏰍􏰈􏰖􏰊􏰋􏰌􏰁􏰎􏰈  These measures show they are negatively correlated.  all_confidence and cosine measures are null-invariant. 10 all_confidence, cosine: another example coffee Not coffee Sum(col.) milk Not milk 1,000 1,000 Sum (row) 2,000 1,000 100,000 2,000 101,000 101,000 103,000  𝑙𝑖𝑓𝑡 𝑚, 𝑐 􏰉 􏰅􏰃􏰃􏰃/􏰅􏰃􏰗􏰃􏰃􏰃 􏰉 25.75 (says positively correlated) 􏰁 􏰘􏰙􏰙􏰙 􏰈􏰖􏰁 􏰘􏰙􏰙􏰙 􏰈 􏰚􏰙􏰛􏰙􏰙􏰙 􏰚􏰙􏰛􏰙􏰙􏰙  𝑎𝑙𝑙_𝑐𝑜𝑛𝑓􏰁𝑚, 𝑐􏰈 􏰉 􏰅􏰃􏰃􏰃 􏰉 0.5 (says independent) 􏰒􏰃􏰃􏰃  𝑐𝑜𝑠𝑖𝑛𝑒 𝑚, 𝑐 􏰉 􏰅􏰃􏰃􏰃 􏰉 0.5 (says independent) 􏰒􏰃􏰃􏰃􏰖􏰒􏰃􏰃􏰃  Actualdata:independent  Other measures: max_confidence, Kulczynski measure 11 coffee Not coffee Sum(col.) milk Not milk 1,000 1,000 Sum (row) 2,000 Kulczynski and Imbalance Ratio (IR) 1,000 100,000 2,000 101,000 101,000 103,000  Kulc􏰁A,B􏰈􏰉􏰁P􏰁A|B􏰈􏰜P􏰁B|A􏰈􏰈/2,oraverageoftwocond.prob.  Between 0 and 1; 􏰝 0.5: positive; 􏰞 0.5: negative; 􏰉 0.5: independent Kulc𝑚,𝑐 􏰉 IR(A,B) |sup(A)sup(B)| 􏰁independent􏰈 1 (P(m | c)  P(c | m))  1 ( mc  mc )  1 (1000  1000 )  0.5 2 2 c m 2 2000 2000 sup(A)sup(B)sup(AB) , 0≤IR<1 IR(m,c) |mc|  |20002000| 0 mcmc 200020001000 (balanced) 12 Kulczynski and Imbalance Ratio (IR)  In the table in the next slide, mc, m’c, mc’, and m’c’ represent the following entries in the contingency table.: coffee Not coffee Sum(col.) mc’ m’c’ milk Not milk mc m’c Sum (row) 13 Kulczynski and Imbalance Ratio (IR)  Example: D1 (P) 10,000 m’c mc’ m’c’ coffee Not coffee Sum(col.) milk 10,000 (mc) 1,000 (mc’) 11,000 Not milk 1,000 (m’c) 100,000 (m’c’) 101,000 Sum (row) 11,000 101,000 112,000 mc 1,000 1,000 100,000 14 D1(P) 10000 D2(P) 10000 D3(N) 100 D4(I) 1000 D5(*) 1000 D6(*) 1000 1000 1000 1000 1000 1000 1000 1000 1000 100000 100 100000 100000 100000 100000 9.26 (P) 1.00(I) 8.44(P) 25.75(P) 9.18(P) 1.97(P) 0.91(P) 0.91(P) 0.09(N) 0.50(I) 0.09(N) 0.01(N) 0.91(P) 0.91(P) 0.91(P) 0.91(P) 0.09(N) 0.09(N) 0.50(I) 0.50(I) 0.29(N) 0.50(I) 0.01(N) 0.50(I) 0 0 0 0 mc m’c mc’ m’c’ lift all_conf cosine Kulc IR 100 10000 10 100000 0.83 0.99 Comparison  P:positive,N:negative,I:independent,*:contradictory  BothD1andD2havepositivelycorrelateddatabutliftshowsdifferent values.  D3hasnegativelycorrelateddatabutliftsayspositive.  D4hasindependentdatabutliftsayspositive.Thisisbecauseliftis affected by null transactions.  all_conf,cosine,andKulczynskiarenull-invariant. 15 mc m’c mc’ (mc)’ lift all_conf cosine Kulc IR D5(*) 1000 100 D6(*) 1000 10 10000 100000 100000 100000 9.18(P) 1.97(P) 0.09(N) 0.29(N) 0.01(N) 0.01(N) 0.50(I) 0.50(I) 0.83 0.99 Comparison (continued)  P:positive,N:negative,I:independent,*:contradictory  D5:P(c|m)=9.09%(negativelycorrelated) D5: P(m|c) = 90.9% (positively correlated)  D6:P(c|m)=0.99%(negativelycorrelated) D6: P(m|c) = 99% (positively correlated)  Kulczynski says independent (makes sense)  IRindicatesD5andD6areunbalanced.  Kulczynski along with IR is recommended. 16 D4 D5 D4 (I) 1,000 1,000 1,000 100,000 D5 (*) 1,000 100 10,000 100,000 mc m’c mc’ m’c’ mc m’c mc’ m’c’ Kulc = 0.5, IR = 0 Kulc = 0.5, IR = 0.83 Kulczynski and Imbalance Ratio (IR) milk Not milk Sum milk Not milk Sum 1,100 coffee 1,000 1,000 2,000 (mc) (m’c) coffee 1,000 (mc) 100 (m’c) Not 1,000 100,000 101,000 coffee (mc’) (m’c’) Not coffee 10,000 (mc’) 100,000 (m’c’) 110,000 111,100 Sum 2,000 101,000 103,000 Sum 11,000 100,100 17 Multi-level Association Rule Mining  When there is a concept hierarchy in the database  Not many strong association rules at low levels  Different users are interested in association rules at different levels  Example database TID Items 1 IBM desktop computer, Sony b/w printer 2 MS educational SW, MS financial management SW 3 Logitech mouse, Ergoway wrist pad 4 IBM desktop computer, MS financial management SW 5 IBM desktop computer ...... 18 desktop laptop educa- finan- color b/w mouse wrist pad IBM MS HP SONY Logitech Ergoway Multi-level Association Rule Mining  Concept hierarchy computer software printer accessory ... tional cial ... ... ... ... ... all 19 Multi-level Association Rule Mining  Mining rules  In general, top-down approach is employed.  Once all frequent itemsets at level 1 are identified, then those at level 2 are found, and so on.  For each level, any algorithm to find frequent itemsets can be used.  Variations  Using uniform support  Using reduced minimum support 20 Multi-level Association Rule Mining  Using uniform support for all levels  Same minimum support is used for all levels. Level 1 min_sup = 5% computer (support = 10%) Level 2 min_sup = 5% desktop (support = 6%) laptop (support = 4%)  Frequent itemsets: computer, desktop  laptop is discarded 21 Multi-level Association Rule Mining  Using reduced minimum support at lower levels  Each level has its own min. support.  Lower levels have smaller min. supports. Level 1 min_sup = 5% computer (support = 10%) Level 2 min_sup = 3% desktop (support = 6%) laptop (support = 4%)  All (computer, desktop, and laptop) are found as frequent itemsets. 22 Multi-level Association Rule Mining  If a node does not satisfy minimum support, its children don’t need to be examined.  If min. support is set too high, some meaningful rules at low levels may be missed.  If min. support is set too low, too many uninteresting rules at high levels may be found. 23 Mining Sequential Patterns  Given a sequence of elements or events, find a sequential pattern that occurs frequently.  Applications:  Customer shopping sequences  Web click streams  Program execution sequences  Biological sequences  Sequence of events in natural or social development 24 Mining Sequential Patterns  Wewilldiscusssequentialpatternminingfromatransactional database.  The approach described here is based on a sequential pattern mining algorithm called GSP (Generalized Sequential Patterns).  Atransactionisatuple(sid,ts,itemset),where sid is a sequence id, which typically is customer id (or cid) ts is a timestamp itemset is a set of items (and items are ordered)  Adata-sequenceisanorderedlistoftransactions.  A (transactional) database is a set of data-sequences 25  Exampledatabase CID Time Items • There are five data-sequences, each corresponding to a customer. 1 3/25 30 1 3/30 90 2 3/10 10 2 3/15 20,30 2 3/20 40,60,70 3 3/25 30,50,70 4 3/25 30 • First data-sequence has two transactions, the second data- sequence has three transactions, ... 4 3/30 40,70 4 4/25 90 5 3/12 90 • Each transaction represents a purchase by a customer of a set of items at a certain time. Mining Sequential Patterns 26  Exampledatabase CID Time Items • A sequence is an ordered list of itemsets. 1 3/25 30 1 3/30 90 2 3/10 10 2 3/15 20,30 2 3/20 40,60,70 3 3/25 30,50,70 4 3/25 30 • Sequence examples: <{30}, {90}>
4 3/30 40,70
4 4/25 90
5 3/12 90
• Above sequences are 2-sequence, 3- sequence, and 3-sequence, respectively
Mining Sequential Patterns
<{30}, {40, 70}> <{10}, {40, 60}>
• A k-sequence is a sequence consisting of k items.
27

Mining Sequential Patterns
 AsequenceA=isasubsequenceofanothersequence B=ifthereexistintegersi1
B = <{7}, {3,8}, {9}, {4,5,6}, {8}>
A is a subsequence of B.
28

E = <{3}, {5,8}>
G = <{3}, {8}>
H = <{3,8}>
Mining Sequential Patterns
C = <{3}, {15}>
D = <{3,5}, {7,8}, {2, 15}>
E is not a subsequence of F. F = <{2,3}, {5}, {7,8}>
?
G is not a subsequence of H.
?
C is a subsequence of D.
29

C1 1 C1 2
C1 15
C2 1
C2 20 C2 50 C2 50
cheese
butter
bread, milk butter, cheese egg, bread milk
= 1 (50%) support of <{cheese}, {milk}>
egg
Mining Sequential Patterns
 Supportofasequenceisthenumberofdatasequencesthat“contain” this sequence.
 Adatasequencedcontainsasequencesifsisasubsequenceofd
cid ts
Itemset
support of <{cheese}> = 2 (100%) support of <{egg}> = 1 (50%) support of <{cheese}, {egg}>
= 2 (100%) support of <{cheese}, {bread, milk>
= 1 (50%)
30

 Exampledatabase CID Time Items
• Supports of the following sequences are:
1 3/25
1 3/30
2 3/10
2 3/15
2 3/20
3 3/25
4 3/25
4 3/30
4 4/25
5 3/12
30
90
10 20,30 40,60,70 30,50,70 10,30 40,70 60,90 90
<{30}, {90}>: 2 (or 40%) <{30}, {40, 70}>: 2 (or 40%) <{10}, {40, 60}>: 1 (or 20%)
Mining Sequential Patterns
• Goal: To discover all sequences with a user-specified minimum support
31

 Example
CID Time Items
Mine all frequent sequential patterns with minimum support = 40% (or 2 data-sequences).
1 1
1 4
2 3
2 7
2 20 60,70,80
3 2 30,50
3 8 70,80
<{80}>:5
L2 (frequent 2-sequences)
4 2 70
4 10 80
5 1 10,20
5 10 30 5 28 70 5 31 80
<{10},{30}>:2 <{10},{70}>:2 <{10},{80}>:3 <{30},{70}>:3 <{30},{80}>:4 <{70},{80}>:2 <{70, 80}>:2
10,30 80 10
L1 (frequent 1-sequences): <{10}>:3
Mining Sequential Patterns
L3 (frequent 3-sequences) <{10},{30},{70}>:2 30,40 <{30}>:4 <{10},{30},{80}>:2
<{70}>:4
<{30},{70,80}>:2
32

Mining Sequential Patterns
 Sequential pattern mining on Weka
 Weka implemented GSP (with some limitations).
 Transactional database needs to be converted to an appropriate format as an arff file.
 One transaction per tuple
 All tuples, so all transactions, have the same number of attributes, and each item is a value of the corresponding attribute.
33

 Example
Mining Sequential Patterns
34

 Result with
1-sequences
min_sup = 90%
[1] <{revolution}> (3) [2] <{civil war}> (3) [3] <{anderson}> (3) [4] <{football}> (3)
Mining Sequential Patterns
2-sequences
[1] <{revolution}{civil war}> (3) [2] <{revolution}{anderson}> (3) [3] <{civil war,anderson}> (3)
3-sequences
[1] <{revolution}{civil war,anderson}> (3)
35

min_sup =
60%
1-sequences
[1] <{revolution}> (3) [2] <{civil war}> (3) [3] <{anderson}> (3) [4] <{baseball}> (2) [5] <{football}> (3)
3-sequences
[1] <{revolution}{civil war,anderson}> (3) [2] <{revolution}{civil war,football}> (2) [3] <{revolution}{anderson,football}> (2) [4] <{civil war,anderson,football}> (2)
[5] <{revolution,baseball}{civil war}> (2) [6] <{revolution,baseball}{anderson}> (2) [7] <{revolution,baseball}{football}> (2) [8] <{baseball}{civil war,anderson}> (2) [9] <{baseball}{civil war,football}> (2) [10] <{baseball}{anderson,football}> (2)
Mining Sequential Patterns
2-sequences
[1] <{revolution}{civil war}> (3) [2] <{revolution}{anderson}> (3) [3] <{revolution}{football}> (2) [4] <{civil war,anderson}> (3) [5] <{revolution,baseball}> (2) [6] <{baseball}{civil war}> (2) [7] <{baseball}{anderson}> (2) [8] <{baseball}{football}> (2) [9] <{civil war,football}> (2)
[10] <{anderson,football}> (2)
4-sequences
[1] <{revolution}{civil war,anderson,football}> (2) [2] <{revolution,baseball}{civil war,anderson}> (2) [3] <{revolution,baseball}{civil war,football}> (2) [4] <{revolution,baseball}{anderson,football}> (2) [5] <{baseball}{civil war,anderson,football}> (2)
5-sequences
[1] <{revolution,baseball}{civil war,anderson,football}> (2)
36

Associative Classification
 Each sample (or tuple) is considered as a transaction.
 An (attribute, value) pair is an item.
 Frequent itemsets are mined using an association rule mining algorithm.
 Strong rules are mined from the frequent itemsets, which satisfy the minimum support and minimum confidence thresholds.
 We only use rules which has class attribute in the consequent.
 Rules are organized to form a rule-based classifier.
37

 Example outlook
temperature humidity windy play
dataset
sunny hot sunny hot overcast hot rainy mild rainy cool rainy cool overcast cool sunny mild sunny cool rainy mild sunny mild overcast mild overcast hot rainy mild
high F N high T N high F Y high F Y normal F Y normal T N normal T Y high F N normal F Y normal F Y normal T Y high T Y normal F Y high T N
Associative Classification
38

Associative Classification
 For association rule mining, each tuple is considered as a transaction and each (attribute, value) pair becomes an itme as follows:
TID items
1 outlook=sunny, temperature=hot, humidity=high, windy=F, play=N
2 outlook=sunny, temperature=hot, humidity=high, windy=T, play=N
3 outlook=overcast, temperature=hot, humidity=high, windy=F, play=Y
4 outlook=rainy, temperature=mild, humidity=high, windy=F, play=Y
5 outlook=rainy, temperature=cool, humidity=normal, windy=F, play=Y
6…
39

 On Weka, Set car to True
Specify the index of the class attribute
Associative Classification
40

 Result
Top ten rules by confidence
Associative Classification
41

• •
Han, J., Kamber, M., Pei, J., “Data mining: concepts and techniques,” 3rd Ed., Morgan Kaufmann, 2012
• •
Multilevel association rule mining: Han, J., Kamber, M., Pei, J., “Data mining: concepts and techniques,” 3rd Ed., Morgan Kaufmann, 2012, pp. 283 – 287.

Associative classification: Han, J., Kamber, M., Pei, J., “Data mining: concepts and techniques,” 3rd Ed., Morgan Kaufmann, 2012, pp. 416 – 419.
42
http://www.cs.illinois.edu/~hanj/bk3/
References
Sequential pattern mining: R. Srikant and R. Agrawal, “Mining sequential patterns: generalization and performance improvements,” Proc. 5th Int’l Conf. on Extending Database Technology: Advances in Database Technology, pp. 3 – 17.