CS代考 STAT318 — Data Mining

STAT318 — Data Mining
Dr
University of Canterbury, Christchurch,
, University of Canterbury 2021
STAT318 — Data Mining ,1 / 28

Association Analysis
Association analysis is an unsupervised learning technique that finds interesting associations in large data sets.
It is often applied to large commercial data bases, where it is useful for selective marketing.
If the data are binary-valued (asymmetric), the problem is called market basket analysis.
Let
be a set of p items and let
I = {i1,i2,…,ip}
T = {t1,t2,…,tN}
be a set of N transactions such that tj ⊂ I for j = 1,2,…,N.
, University of Canterbury 2021
STAT318 — Data Mining ,2 / 28

Terminology
Itemset: A collection of items from I. If X has k items it is called a k-itemset.
Support Count: The number of transactions that have X as a subset σ(X)=|{tj :X ⊂tj andtj ∈T}|.
Support: The fraction of transactions that have X as a subset s(X) = σ(X).
N
Frequent Itemset: An itemset whose support is greater than or equal to a 0 < minsup < 1 threshold. , University of Canterbury 2021 STAT318 — Data Mining ,3 / 28 Rule Evaluation Metrics Association Rule: An association rule is an implication expression of the form X → Y such that X , Y ⊂ I and X ∩ Y = ∅. Support: The fraction of transactions containing both X and Y as a subset s(X → Y ) = σ(X ∪ Y ) N Confidence: How frequently items in Y appear in transactions that also contain X c(X → Y ) = σ(X ∪ Y ) σ(X) , University of Canterbury 2021 STAT318 — Data Mining ,4 / 28 Association Analysis Example: Consider the following transactions ID Items 1 2 3 4 5 {Bread, Milk} {Bread, Diapers, Beer, Eggs} {Milk, Diapers, Beer, Cola} {Bread, Milk, Diapers, Beer} {Bread, Milk, Diapers, Cola} (a) Find the support of X = (b) Suppose minsup = 4/5. Find all frequent itemsets. (c) Find the support and confidence of the rule {Bread, Milk, Diapers}. {Bread, Diapers} → {Beer}. , University of Canterbury 2021 STAT318 — Data Mining ,5 / 28 Association Rule Mining The goal of association rule mining is to find strong rules: s(X→Y) ≥ minsup c(X→Y) ≥ minconf. A brute-force approach to finding strong rules is computationally prohibitive because there are possible rules. 3p −2p+1 +1 In the previous example there are p = 6 items, giving 602 association rules. , University of Canterbury 2021 STAT318 — Data Mining ,6 / 28 Association Rule Mining One way to reduce computational effort is to decouple the support and confidence conditions. Example: The following association rules have the same support. {Bread, Diapers} → {Beer}, {Diapers, Beer} → {Bread}, {Beer} → {Bread, Diapers}, {Bread, Beer} → {Diapers} {Bread} → {Diapers, Beer} {Diapers} → {Bread, Beer}. Association rule mining can be viewed as a two-step process: 1 Frequent itemset generation; 2 Association rule generation. , University of Canterbury 2021 STAT318 — Data Mining ,7 / 28 Frequent Itemset Generation: Itemset Lattice null abcde ab ac ad ae bc bd be cd ce de abc abd abe acd ace ade bcd bce bde cde abcd abce abde acde bcde abcde , University of Canterbury 2021 STAT318 — Data Mining ,8 / 28 Frequent Itemset Generation There are 2p − 1 candidate itemsets to consider and hence, it is computationally prohibitive to test all of them. The number of candidate itemsets can be dramatically reduced using the Apriori principle. Apriori Principle: If an itemset is frequent, then all of its subsets must also be frequent. Corollary: If an itemset is infrequent, then all of its super sets must also be infrequent. , University of Canterbury 2021 STAT318 — Data Mining ,9 / 28 Apriori Principle: Let {abc} be frequent null abcde ab ac ad ae bc bd be cd ce de abc abd abe acd ace ade bcd bce bde cde abcd abce abde acde bcde abcde , University of Canterbury 2021 STAT318 — Data Mining ,10 / 28 Apriori Principle: Let {ae} be infrequent null abcde ab ac ad ae bc bd be cd ce de abc abd abe acd ace ade bcd bce bde cde abcd abce abde acde bcde abcde , University of Canterbury 2021 STAT318 — Data Mining ,11 / 28 Apriori Algorithm (1995) 1 Initialize: Set k = 1 and find all frequent items (1-itemsets). 2 Repeat until no new frequent itemsets are found: (a) Generate candidate (k + 1)-itemsets from frequent k-itemsets. (b) Eliminate candidate itemsets containing infrequent k-itemsets. (c) Calculate the support of each candidate. (d) Eliminate infrequent (k + 1)-itemsets and increment k. , University of Canterbury 2021 STAT318 — Data Mining ,12 / 28 Apriori Algorithm: Step 2(a) We need a complete (guarantee every frequent itemset is generated) and efficient (no duplicates and few infrequent candidate itemsets) way of generating candidate itemsets. Fk−1 × Fk−1 Method: 1 Place k-itemsets in lexicographic order (xi ≤ xi+1): X = (x1,x2,...,xk) Y = (y1,y2,...,yk). 2 These candidate itemsets are merged together to form a candidate (k + 1)-itemset iff xi =yi fori=1,2,...,k−1, andxk ̸=yk. , University of Canterbury 2021 STAT318 — Data Mining ,13 / 28 Fk−1 × Fk−1 Method: Example Example: Let I = {a, b, c, d, e} and suppose that there are five frequent 3-itemsets {b,a,c},{a,b,d},{a,c,d},{a,c,e},{b,c,d}. (a) Use the Fk−1 × Fk−1 method to generate all possible candidate 4-itemsets. (b) Are any of the candidate 4-itemsets infrequent? , University of Canterbury 2021 STAT318 — Data Mining ,14 / 28 Apriori Algorithm: Example An illustration of the Apriori algorithm using the transactions on slide 5 with minsup = 3/5. Candidate 1-itemsets Beer 3 Bread 4 Cola 2 Diapers 4 Milk 4 Eggs 1 Candidate 2-itemsets {Beer, Bread} 2 {Beer, Diapers} 3 {Beer, Milk} 2 {Bread, Diapers} 3 {Bread, Milk} 3 {Diapers, Milk} 3 Candidate 3-itemsets {Bread, Diapers, Milk} 2 The brute force approach generates 41 candidate itemsets, whereas the Apriori algorithm generates 13. , University of Canterbury 2021 STAT318 — Data Mining ,15 / 28 Maximal Frequent Itemsets The number of frequent itemsets found by the Apriori algorithm is usually huge, especially when minsup is low. Example: if a frequent 100-itemset is found, there are ≈ 1.27 × 1030 frequent sub-itemsets! Maximal Frequent Itemset: a frequent itemset for which none of its proper (immediate) super sets are frequent. Maximal frequent itemsets provide a compact representation of frequent itemsets, but they contain no information about the support counts of their power sets (all frequent itemsets). , University of Canterbury 2021 STAT318 — Data Mining ,16 / 28 Maximal Frequent Itemsets null abcde ab ac ad ae bc bd be cd ce de abc abd abe acd ace ade bcd bce bde cde abcd abce abde acde bcde abcde , University of Canterbury 2021 STAT318 — Data Mining ,17 / 28 Closed Frequent Itemsets Closed Itemsets: an itemset X is closed if none of its proper (immediate) super sets has exactly the same support count as X. Closed Frequent Itemsets: a closed itemset with support greater than or equal to minsup. Example: Using minsup = 2/5, find the closed frequent itemsets in: ID Items 1 {a,b,c} 2 {a,b,c,d} 3 {b,c,e} 4 {a,c,d,e} 5 {d,e} , University of Canterbury 2021 STAT318 — Data Mining ,18 / 28 Closed Frequent Itemsets null 3a3b4c3d3e 2 ab 2 abc 3 ac 1 abd 2 ad abe 1 abcd 1 ae 2 acd abce 3 bc 1 bd 1 ace 1 ade abde abcde 1 be 1 bcd 1 acde 2 cd 1 bce bcde 2 ce bde 2 de 1 cde , University of Canterbury 2021 STAT318 — Data Mining ,19 / 28 Rule Generation Each frequent k-itemset, Y, has 2k association rules of the form X →Y \X, where X ⊂ Y . Example: Let Y = {a,b,e} be a frequent itemset, then the possible association rules are: ∅ → {a,b,e}, {a,b} → {e}, {a,e} → {b}, {b,e} → {a} {a} → {b,e}, {b} → {a,e}, {e} → {a,b}, {a,b,e} → ∅. The goal is to find strong rules: c(X →Y \X)= s(Y) ≥minconf. s(X) , University of Canterbury 2021 STAT318 — Data Mining ,20 / 28 Confidence Based Pruning It is not necessary to calculate the confidence of every possible rule from a frequent itemset. Let Y be a frequent itemset with X ⊂ Y . If a rule fails the minconf threshold X →Y \X { }
bcd -> a
acd -> b
abd -> c
abc -> d
cd -> ab bd -> ac
bc -> ad
ad -> bc
ac -> bd ab -> cd
d -> abc
c -> abd
b -> acd
a -> bcd
, University of Canterbury 2021
STAT318 — Data Mining ,22 / 28

Strong rules are not necessarily interesting
Example: Consider the following association rule and contingency table:
{Tea} → {Coffee}
Tea Not Tea
(support = 15%, confidence = 75%).
Coffee
Not Coffee
800 200 1000
(a) (b)
800 Pr(Coffee) =
150 50 650 150
Pr(Coffee| Tea) =
200
, University of Canterbury 2021
STAT318 — Data Mining ,23 / 28

Objective Interestingness Measures
Lift is an interestingness measure based on correlation.
Lift: Given two itemsets X and Y , the lift measure is defined as
lift(X,Y)= s(X ∪Y) s(X )s(Y )
= c(X →Y). s(Y )
Interpretation:
lift(X , Y )
= 1, 
> 1, < 1, X and Y are X and Y are X and Y are statistically independent positively correlated negatively correlated. Example: Find lift(Tea, Coffee). , University of Canterbury 2021 STAT318 — Data Mining ,24 / 28 Objective Interestingness Measures Cosine: Given two itemsets X and Y , the cosine measure is defined by s(X ∪Y) cosine(X , Y ) = 􏰎s(X )s(Y ) = 􏰎c(X→Y)c(Y→X). A cosine value close to one indicates that most of the transactions containing X also contain Y , and vice versa. A value close to zero means most transactions containing X do not contain Y , and vice versa. , University of Canterbury 2021 STAT318 — Data Mining ,25 / 28 Objective Interestingness Measures Symmetric Measure: a measure M is symmetric iff M(X →Y)=M(Y →X). Asymmetric Measure: a measure that is not symmetric. Null Transaction: a transaction that does not include any of the itemsets being examined. Null Invariant: a measure is null invariant if its value does not depend on null transactions. , University of Canterbury 2021 STAT318 — Data Mining ,26 / 28 Objective Interestingness Measures Example: Is lift a symmetric measure? Is it null invariant? Example: Is cosine a symmetric measure? Is it null invariant? , University of Canterbury 2021 STAT318 — Data Mining ,27 / 28 Summary Association rule mining consists of finding frequent itemsets, from which strong association rules are formed. The Apriori algorithm is a seminal method for association analysis. Association rules are among data mining’s biggest successes (Hastie, Tibshirani, Friedman). , University of Canterbury 2021 STAT318 — Data Mining ,28 / 28