CS代写 STAT318 — Data Mining

STAT318 — Data Mining
Association Analysis
Jiˇr ́ı Moravec
University of Canterbury, Christchurch,

Copyright By PowCoder代写 加微信 powcoder

Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 1 / 47

Motivation
You are working in a marketing department in supermarket. Consider following shopping list:
1 {Bread, Milk}
2 {Bread, Diapers, Beer, Eggs}
3 {Milk, Diapers, Beer, Cola}
4 {Bread, Milk, Diapers, Beer}
5 {Bread, Milk, Diapers, Cola}
What you can learn about your customers?
Which items should be restocked together?
Which items should be marketed to which customers?
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 2 / 47

Motivation
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 3 / 47

Association Analysis
Association analysis is an unsupervised learning technique that finds interesting associations in large datasets.
It is often applied to large commercial data bases, where it is useful for selective marketing.
Can be interpreted as “mode finding” or “bump hunting” – finding a parts of joint probability space that have larger probability than expected
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 4 / 47

Association Analysis
Market Basket Analysis: Identify items frequently purchased together, understand shopping habits, do selective marketing and plan shelf space.
Credit card risk analysis: Finding characteristics of customers that are likely to default.
Medical data: Identify relationships between symptoms, test results and illness.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 5 / 47

An Introduction to Statistical Learning doesn’t have chapter on Association analysis
The Elements of Statistical Learning have a short theoretical summary.
we will use Introduction to Data Mining (2019)
relevant chapter (Chapter 5: Association Analysis: Basic Concepts and Algorithm) is available on the authors website: https://www-users.cse.umn.edu/ ̃kumar001/dmbook
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 6 / 47

Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 7 / 47

Theoretical introduction
The goal of Association analysis is to find collection of prototype of X-values v1,…,vL for X such that Pr(X = vi) is large.
This is in essence “mode finding” or “bump hunting”.
This is impossibly difficult problem for anything but simple probability density.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 8 / 47
No need to remember, just a demonstration

Theoretical introduction
Simplifying assumption: Instead of looking for values x, where Pr(x) is large, look at regions of X.
Let Sj be a set of all values of variable (feature) Xj (support of Xj). Find a subset of variables s1,…,sp, sj ⊆ Sj such, that:
conjunctive rule
􏰖 􏰙􏰘 􏰗 Pr􏰀􏰕j = 1p(Xj ∈ sj)􏰁,
is relatively large.
When p and N are small, this is solvable, but untractable for large databases, more simplifying solutions are required.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 9 / 47

Theoretical introduction – Market Basket Analysis
Instead of considering a region, sj ⊆ Sj , consider only a single variable sj = v0j or entire set sj = Sj (in which case, Xj is said to not appear in the rule).
The goal is to now find a subset if integers J ⊂ {1, . . . , p} and corresponding values v0j,j ∈ J, such that:
We also recode variables using dummy variable technique so that Zk = 1 if corresponding variable is of particular value and Zk = 0 otherwise.
Pr 􏰕(Xj=v0j) , j∈J
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 10 / 47

Theoretical introduction – Market Basket Analysis
The goal is now to find such subset of integers K ⊂ 1,…,K such that 
 Pr 􏰍(Zk=1) ,
k∈K is large. This is now easily estimated:
Pr 􏰍(Zk=1) = 􏰌􏰍zik, k∈K i=1 k∈K
where zik is value of dummy variable Zk for the i − th case.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 11 / 47
We went from something untractable, to something complex, to something we can easily calculate by hand.
Thus it can be easily calculated on a big database by computer.

Theoretical introduction – Example
Consider following shopping list:
1 {Bread, Milk}
2 {Bread, Diapers, Beer, Eggs}
3 {Milk, Diapers, Beer, Cola}
4 {Bread, Milk, Diapers, Beer}
5 {Bread, Milk, Diapers, Cola}
What is the probability of {Bread, Milk}?
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 12 / 47
Transform into binary representation.
Perform the calculation according to equation:
N i=1 k∈K equivalent to just counting the occurence of {Bread, Milk}.

Terminology
Let I = {i1,…ip} be a set of p items.
Let T = {t1,…tN} be set of N transactions such that: tj ⊂ I, for j = 1,…,N
Itemset is a collection of items X from I. If X has k items, it is called a k -itemset.
Support count is the number of transactions that have X as a subset σ(X) = |{tj : X ⊂ tjandtj ∈ T}|.
Support is the fraction of transactions that have X as a subset s(X) = σ(X) = Pr(X).
Frequent itemset is an itemset with support greater or equal to some minsup threshold, 0 < minsup < 1. Jiˇr ́ı Moravec, University of Canterbury 2022 STAT318 — Data Mining 13 / 47 Terminology Association Rule is an implication expression of the form X → Y such that X , Y ⊂ I and X ∩ Y = ∅. For the rule X → Y , X is called antecedent (or LHS) and Y is called consequent (or RHS). Association rule X → Y is not implication (causality), but it is interpreted as co-occurence: transaction containing X also tend to contain Y . Jiˇr ́ı Moravec, University of Canterbury 2022 STAT318 — Data Mining 14 / 47 Some of those associations might not be significant, some of them might be spurious and some of them might not be very interesting, so we need some form of description of these relationships. Terminology The strength of the association rule can be measured using support, confidence and lift: Support of the association rule is the fraction of transactions containing both X and Y as a subset s(X →Y)= σ(X ∪Y) =Pr(X ∩Y)=Pr(X,Y) N Confidence of the association rule is how frequently items in Y appear in transactions that also contain X c(X →Y)= σ(X ∪Y) = Pr(X,Y) =Pr(Y|X) σ(X ) Pr(X ) Lift of the association rule is whether the co-occurence is interesting (more likely than expected) lift(X,Y)=s(X∪Y)=c(X→Y)= Pr(X,Y) s(X)s(Y) s(Y) Pr(X)Pr(Y) Jiˇr ́ı Moravec, University of Canterbury 2022 STAT318 — Data Mining 15 / 47 Note that if X and Y are independent, then Pr(X , Y ) = Pr(X ) Pr(Y ) Lift is thus measure of dependence. Low support means that the occurence of the rule is not very frequent. Confidence is measure of reliability, how often the rule holds (is true). High support and low confidence as well as low support and high confidence are both of low importance. Note again, that the rules are not causal relationship. Example: Consider the following transactions 1 {Bread, Milk} 2 {Bread, Diapers, Beer, Eggs} 3 {Milk, Diapers, Beer, Cola} 4 {Bread, Milk, Diapers, Beer} 5 {Bread, Milk, Diapers, Cola} (a) Find the support of X = {Bread, Milk, Diapers}. (b) Suppose minsup = 4/5. Find all frequent itemsets. (c) Find the support and confidence of the rule {Bread, Diapers} → {Beer}. Jiˇr ́ı Moravec, University of Canterbury 2022 STAT318 — Data Mining 16 / 47 (a) s(X) = 2/5 (b) {Bread}, {Milk}, {Diapers} (c) s({Bread, Diapers} → {Beer}) = 2/5 c(rule) = σ({Bread, Diapers, Beer}) = 2/3 σ({Bread, Diapers}) “When Bread and Diapers were purchased together, 66% of the time Beer was also purchased.” Association Rule Mining The goal of association rule mining is to find strong rules: s(X→Y) ≥ minsup c(X→Y) ≥ minconf. A brute-force approach to finding strong rules is computationally prohibitive because there are possible rules. 3p − 2p+1 + 1 In the previous example there are p = 6 items, giving 602 association rules. Jiˇr ́ı Moravec, University of Canterbury 2022 STAT318 — Data Mining 17 / 47 It is not uncommon to have p > 104 and so the number of strong rules can be extremely large.

Association Rule Mining
One way to reduce computational effort is to decouple the support and confidence conditions (i.e., find rules with high support and then remove low-confidence ones).
Example: The following association rules have the same support.
{Bread, Diapers} → {Beer}, {Diapers, Beer} → {Bread}, {Beer} → {Bread, Diapers},
{Bread, Beer} → {Diapers} {Bread} → {Diapers, Beer} {Diapers} → {Bread, Beer}.
Association rule mining can be viewed as a two-step process:
1 Frequent itemset generation;
2 Association rule generation.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 18 / 47
These rules are binary partitions of the set
{Bread, Diapers, Beer}
Therefore, they all have the same support. If one of these rules is known to be infrequent, we can assert that they all must be infrequent (their confidence values will not necessarily be the same).
An efficient way to find strong rules would be to find all the frequent itemsets. Then for each frequent set X, compute the confidences of rules formed from binary partitions of X.

Frequent Itemset Generation: Itemset Lattice
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 19 / 47
This is an itemset lattice for
which is useful for illustrating the basic concepts of association analysis.
I = {a,b,c,d,e},

Frequent Itemset Generation
There are 2p − 1 candidate itemsets to consider and hence, it is computationally prohibitive to test all of them.
The number of candidate itemsets can be dramatically reduced using the Apriori principle.
Apriori Principle: If an itemset is frequent, then all of its subsets must also be frequent.
Corollary: If an itemset is infrequent, then all of its super sets must also be infrequent.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 20 / 47

Apriori Principle: Let {abc} be frequent
null abcde
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 21 / 47
If {a,b,c} is frequent, then
∅, {a}, {b}, {c}, {a,b}, {a,c}, {b,c}, {a,b,c}
are also frequent. This is called the power set of {a,b,c}.

Apriori Principle: Let {ae} be infrequent
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 22 / 47
Support based pruning: If X is infrequent, then all the super sets of X are infrequent. The super sets of {a,e} (infrequent) are
{a,e}, {a,b,e}, {a,c,e}, {a,d,e}, {a,b,c,e}, {a,b,d,e}, {a,c,d,e}, {a,b,c,d,e}
Because {a,e} is infrequent and it’s a subset of each itemset, these itemsets are also infrequent (we do not need to compute their support to establish this).

Apriori Algorithm (1995)
1 Initialize: Set k = 1 and find all frequent items (1-itemsets).
2 Repeat until no new frequent itemsets are found:
(a) Generate candidate (k + 1)-itemsets from frequent k-itemsets. (b) Eliminate candidate itemsets containing infrequent k-itemsets. (c) Calculate the support of each candidate.
(d) Eliminate infrequent (k + 1)-itemsets and increment k.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 23 / 47
The Apriori algorithm only requires one pass over the data for each value of k. This is important because we assume that the set of transactions cannot fit in our computer’s main memory. The output of the entire analysis is a collection of strong rules. These are usually stored in a data base that can be queried by the user. For example, finding strong rules with particular items in the consequent.
The support threshold (minsup) should not be set too small because the number of passes over the data can grow exponentially with decreasing minsup. One strategy is to consider a sequence of decreasing values and stop when output or computationally burden becomes too large. (You will try different minsup and minconf values in the lab.)

Apriori Algorithm: Generating candidates
We need a complete (guarantee every frequent itemset is generated) and efficient (no duplicates and few infrequent candidate itemsets) way of generating candidate itemsets.
We can’t just generate the full lattice, it is to inefficient
Possible methods:
Brute-Force Method (too inefficient, but just for completeness)
Fk−1 × F1 Method Fk−1 × Fk−1 Method
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 24 / 47

Generating candidates – Brute-Force Method
1 Consider every k-itemset as a potential candidate
2 prune according to apriori principle
Disadvantages:
Generating candidates is easy, but pruning is expensive
The number of generated candidate k-itemsets is 􏰓d􏰔 where d is the number of
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 25 / 47

Fk−1 × F1 Method
Extend (k − 1)-itemset with frequent items (1-itemsets) that are not part of the
(k − 1)-itemset.
Example: Let I = {a, b, c, d, e} and suppose that there are five frequent 3-itemsets
{a,b,c},{a,b,d},{a,c,d},{a,c,e},{b,c,d}.
(a) Use the Fk−1 × F1 method to generate all possible candidate 4-itemsets. (b) Are any of the candidate 4-itemsets infrequent?
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 26 / 47

Fk−1 × F1 Method (continued)
Improvement over the brute-force, but still a lot of candidates generated this way can be infrequent
the Fk−1 × F1 method can generate the same candidate multiple times using a different combination of itemsets
Extend the Fk−1 × F1 method using lexicographic ordering
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 27 / 47

Lexicographic ordering
Sort items in the k-itemset according to their alphabetic order (a particular order doesn’t matter, just an order that is unambiguous)
Extend k − 1-itemset by an itemset that is lexicographically larger.
{c,d} can be augmented with {e}, but not with {a} as {a} is lower on the
lexicographic order.
This reduces the number of infrequent candidates being generated
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 28 / 47

Fk−1 × F1 Method with lexicographic ordering
sort items in itemsets according to their lexicographic order
Extend (k − 1)-itemset with frequent items (1-itemsets) that are not part of the (k − 1)-itemset and are lexicographically larger.
Example: Let I = {a, b, c, d, e} and suppose that there are five frequent 3-itemsets {a,b,c},{a,b,d},{a,c,d},{a,c,e},{b,c,d}.
(a) Use the modified Fk−1 × F1 method to generate all possible candidate 4-itemsets. (b) Are any of the candidate 4-itemsets infrequent?
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 29 / 47

Fk−1 × Fk−1 Method
Some of the itemsets generated by Fk−1 × F1 methods can still be infrequent.
The Fk−1 × Fk−1 method provides an improvement as both immediate subsets are frequent.
Fk−1 × Fk−1 Method:
1 Place k-itemsets in lexicographic order (xi ≤ xi+1):
X = (x1,x2,…,xk)
Y = (y1,y2,…,yk).
2 These candidate itemsets are merged together to form a candidate (k + 1)-itemset iff
xi =yi fori=1,2,…,k−1, andxk ̸=yk.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 30 / 47
Lexicographic order = dictionary order

Fk−1 × Fk−1 Method: Example
Example: Let I = {a, b, c, d, e} and suppose that there are five frequent
3-itemsets
{a,b,c},{a,b,d},{a,c,d},{a,c,e},{b,c,d}.
(a) Use the Fk−1 × Fk−1 method to generate all possible candidate 4-itemsets. (b) Are any of the candidate 4-itemsets infrequent?
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 31 / 47
(a) Before we merge, the transactions must be placed in lexicographic order. Hence, the first transaction must be ordered (if they are not so)
There are two possible merges:
{a,b,c}, {a,b,d} → {a,b,c,d}
{a,c,d}, {a,c,e} → {a,c,d,e}
(b) {a,c,d,e} must be infrequent because {a,d,e} was not listed as a frequent 3-itemset. We would need to scan the set of transactions to determine if {a,b,c,d} is frequent.

Apriori Algorithm: Example
An illustration of the Apriori algorithm using the transactions on slide 2 with minsup = 3/5.
Candidate 1-itemsets
Beer 3 Bread 4
Diapers 4 Milk 4
Candidate 2-itemsets
{Beer, Bread} 2
{Beer, Diapers} 3
{Beer, Milk} 2
{Bread, Diapers} 3 {Bread, Milk} 3 {Diapers, Milk} 3
Candidate 3-itemsets
{Bread, Diapers, Milk} 2
The brute force approach generates 41 candidate itemsets, whereas the Apriori algorithm generates 13.
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 32 / 47
There are many tricks to speed up the algorithm (for example using hash trees for support counting), but these are not covered here because they do not change the basic structure of the algorithm.
The Apriori algorithm represents one of the major advances in data mining technology (The Elements of Statistical Learning, Hastie et al.).

Maximal Frequent Itemsets
The number of frequent itemsets found by the Apriori algorithm is usually huge, especially when minsup is low.
Example: if a frequent 100-itemset is found, there are ≈ 1.27 × 1030 frequent sub-itemsets!
Maximal Frequent Itemset: a frequent itemset for which none of its proper (immediate) super sets are frequent.
Maximal frequent itemsets provide a compact representation of frequent itemsets, but they contain no information about the support counts of their power sets (all frequent itemsets).
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 33 / 47

Maximal Frequent Itemsets
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 34 / 47
Let the red crosses denote infrequent itemsets. Then the un-crossed itemsets are frequent and the red ones are maximal frequent.
• All proper super sets of
are infrequent.
• The power sets of
{a,d}, {a,c,e}, {b,c,d,e}
{a,d}, {a,c,e}, {b,c,d,e}
are frequent.
• {a,c} is not maximal frequent because {a,c,e} is frequent and
{a,c} ⊂ {a,c,e}.

Closed Frequent Itemsets
Closed Itemsets: an itemset X is closed if none of its proper (immediate) super sets has exactly the same support count as X.
Closed Frequent Itemsets: a closed itemset with support greater than or equal to minsup.
Example: Using minsup = 2/5, find the closed frequent itemsets in:
2 {a,b,c,d} 3 {b,c,e}
4 {a,c,d,e} 5 {d,e}
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 35 / 47
Closed itemsets put another way:
An itemset is not closed if at least one of its proper super sets has the same support.
Unlike with maximal frequent itemset, representation using closed itemsets retains the support information.
If you know closed itemsets, you can derive support for all other itemsets in the lattice without having to go through the data.
There are still too many closed itemsets and knowing all of them is superfluous, we need a compact representation of only frequent supersets.
Closed frequent itemsets is a compact representation of frequent itemsets. Using closed frequent itemsets, one can determine the support of all other (non-closed) frequent itemsets.

Closed Frequent Itemsets
3a3b4c3d3e
2 ab 3 ac 2 ad 1 ae 3 bc 1 bd 1 be 2 cd 2 ce 2 de
2abc 1abd abe 2acd 1ace 1ade 1bcd 1bce bde 1cde
1 abcd abce abde 1 acde bcde
Jiˇr ́ı Moravec, University of Canterbury 2022
STAT318 — Data Mining 36 / 47
The closed itemsets are
{c}, {d}, {e}, {a,

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com