STAT318 — Data Mining
Dr
University of Canterbury, Christchurch,
, University of Canterbury 2021
STAT318 — Data Mining ,1 / 28
Association Analysis
Association analysis is an unsupervised learning technique that finds interesting associations in large data sets.
It is often applied to large commercial data bases, where it is useful for selective marketing.
If the data are binary-valued (asymmetric), the problem is called market basket analysis.
Let
be a set of p items and let
T = {t1,t2,…,tN}
be a set of N transactions such that tj μ I for j = 1,2,…,N.
I = {i1,i2,…,ip}
, University of Canterbury 2021
STAT318 — Data Mining ,2 / 28
Asymmetric means 1 is important and 0 is not (we are interested in the 1’s). Examples:
• Market Basket Analysis: Identify items frequently purchased together, understand shopping habits, do selective marketing and plan shelf space.
• Credit card risk analysis: Finding characteristics of customers that are likely to default.
• Medical data: Identify relationships between symptoms, test results and illness.
Association analysis should not be confused with information retrieval. We are not, for example, searching for customers that bought ‘Beer’ with their groceries (this would be supervised learning).
Terminology
Itemset: A collection of items from I. If X has k items it is called a k-itemset.
Support Count: The number of transactions that have X as a subset ‡(X) = |{tj : X μ tj and tj œ T}|.
Support: The fraction of transactions that have X as a subset s(X) = ‡(X).
Frequent Itemset: An itemset whose support is greater than or equal to a 0 < minsup < 1 threshold.
N
, University of Canterbury 2021
STAT318 — Data Mining ,3 / 28
Support can be viewed as an estimate of the probability of observing X in a randomly selected transaction.
Notation: The empty set ÿ is the itemset that contains no items and it is a subset of every transaction.
Rule Evaluation Metrics
Association Rule: An association rule is an implication expression of the form X æ Y such that X , Y μ I and X fl Y = ÿ.
Support: The fraction of transactions containing both X and Y as a subset s(X æ Y ) = ‡(X fi Y )
Confidence: How frequently items in Y appear in transactions that also
N
contain X
c(X æ Y ) = ‡(X fi Y ) ‡(X)
, University of Canterbury 2021
STAT318 — Data Mining ,4 / 28
For the rule X æ Y , X is called the antecedent (or LHS) and Y is called the consequent (or RHS).
Association rules should not be confused with implications. Here ‘æ’ means co-occurrence and not causality. That is, transactions containing X also tend to contain Y .
The strength of a rule can be measured using support and confidence, where s (X æ Y ) ¥ Pr (X fi Y )
c(X æ Y) ¥ Pr(Y|X).
Association Analysis
Example: Consider the following transactions
ID Items
1 {Bread, Milk}
2 {Bread, Diapers, Beer, Eggs} 3 {Milk, Diapers, Beer, Cola} 4 {Bread, Milk, Diapers, Beer} 5 {Bread, Milk, Diapers, Cola}
(a) Find the support of X = {Bread, Milk, Diapers}. (b) Suppose minsup = 4/5. Find all frequent itemsets. (c) Find the support and confidence of the rule
{Bread, Diapers} æ {Beer}.
, University of Canterbury 2021
STAT318 — Data Mining ,5 / 28
(a) s(X) = 2/5
(b) {Bread}, {Milk}, {Diapers}
(c)
s({Bread, Diapers} æ {Beer}) = 2/5 c(rule) = ‡({Bread, Diapers, Beer}) = 2/3
‡({Bread, Diapers})
“When Bread and Diapers were purchased together, 66% of the time Beer was also purchased.”
Association Rule Mining
The goal of association rule mining is to find strong rules: s(XæY) Ø minsup
c(XæY) Ø minconf.
A brute-force approach to finding strong rules is computationally prohibitive
because there are possible rules.
3p ≠2p+1 +1
In the previous example there are p = 6 items, giving 602 association rules.
, University of Canterbury 2021
STAT318 — Data Mining ,6 / 28
It is not uncommon to have p > 104 and so the number of strong rules can be extremely large. (We will talk about choosing minsup and minconf on Slide 12.)
Association Rule Mining
One way to reduce computational e ort is to decouple the support and confidence conditions.
Example: The following association rules have the same support.
{Bread, Diapers} æ {Beer}, {Bread, Beer} æ {Diapers} {Diapers, Beer} æ {Bread}, {Bread} æ {Diapers, Beer} {Beer} æ {Bread, Diapers}, {Diapers} æ {Bread, Beer}.
Association rule mining can be viewed as a two-step process:
1 Frequent itemset generation;
2 Association rule generation.
, University of Canterbury 2021
STAT318 — Data Mining ,7 / 28
These rules are binary partitions of the set {Bread, Diapers, Beer}
Therefore, they all have the same support. If one of these rules is known to be infrequent, we can assert that they all must be infrequent (their confidence values will not necessarily be the same).
An e cient way to find strong rules would be to find all the frequent itemsets. Then for each frequent set X , compute the confidences of rules formed from binary partitions of X.
Frequent Itemset Generation: Itemset Lattice
null
abcde
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
, University of Canterbury 2021
STAT318 — Data Mining ,8 / 28
This is an itemset lattice for
I = {a,b,c,d,e},
which is useful for illustrating the basic concepts of association analysis.
Frequent Itemset Generation
There are 2p ≠ 1 candidate itemsets to consider and hence, it is computationally prohibitive to test all of them.
The number of candidate itemsets can be dramatically reduced using the Apriori principle.
Apriori Principle: If an itemset is frequent, then all of its subsets must also be frequent.
Corollary: If an itemset is infrequent, then all of its super sets must also be infrequent.
, University of Canterbury 2021
STAT318 — Data Mining ,9 / 28
Apriori Principle: Let {abc} be frequent null
abcde
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
, University of Canterbury 2021
STAT318 — Data Mining ,10 / 28
If {a,b,c} is frequent, then
ÿ, {a}, {b}, {c}, {a,b}, {a,c}, {b,c}, {a,b,c}
are also frequent. This is called the power set of {a,b,c}.
Apriori Principle: Let {ae} be infrequent null
abcde
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
, University of Canterbury 2021
STAT318 — Data Mining ,11 / 28
Support based pruning: If X is infrequent, then all the super sets of X are infrequent. The super sets of {a,e} (infrequent) are
{a,e}, {a,b,e}, {a,c,e}, {a,d,e}, {a,b,c,e}, {a,b,d,e}, {a,c,d,e}, {a,b,c,d,e}
Because {a,e} is infrequent and it’s a subset of each itemset, these itemsets are also infrequent (we do not need to compute their support to establish this).
Apriori Algorithm (1995)
1 2
Initialize: Set k = 1 and find all frequent items (1-itemsets). Repeat until no new frequent itemsets are found:
(a) Generate candidate (k + 1)-itemsets from frequent k-itemsets. (b) Eliminate candidate itemsets containing infrequent k-itemsets. (c) Calculate the support of each candidate.
(d) Eliminate infrequent (k + 1)-itemsets and increment k.
, University of Canterbury 2021
STAT318 — Data Mining ,12 / 28
The Apriori algorithm only requires one pass over the data for each value of k. This is important because we assume that the set of transactions cannot fit in our computer’s main memory.
The output of the entire analysis is a collection of strong rules. These are usually stored in a data base that can be queried by the user. For example, finding strong rules with particular items in the consequent.
The support threshold (minsup) should not be set too small because the number of passes over the data can grow exponentially with decreasing minsup. One strategy is to consider a sequence of decreasing values and stop when output or computationally burden becomes too large. (You will try di erent minsup and minconf values in the lab.)
Apriori Algorithm: Step 2(a)
We need a complete (guarantee every frequent itemset is generated) and e cient (no duplicates and few infrequent candidate itemsets) way of generating candidate itemsets.
Fk≠1 ◊ Fk≠1 Method:
1 Place k-itemsets in lexicographic order (xi Æ xi+1):
X = (x1,x2,…,xk) Y = (y1,y2,…,yk).
2 These candidate itemsets are merged together to form a candidate (k + 1)-itemset i xi =yi fori=1,2,…,k≠1, andxk ”=yk.
, University of Canterbury 2021
STAT318 — Data Mining ,13 / 28
Lexicographic order = dictionary order
Fk≠1 ◊ Fk≠1 Method: Example
Example: Let I = {a, b, c, d, e} and suppose that there are five frequent
3-itemsets
{b,a,c},{a,b,d},{a,c,d},{a,c,e},{b,c,d}.
(a) Use the Fk≠1 ◊ Fk≠1 method to generate all possible candidate 4-itemsets. (b) Are any of the candidate 4-itemsets infrequent?
, University of Canterbury 2021
STAT318 — Data Mining ,14 / 28
(a) Before we merge, the transactions must be placed in lexicographic order. Hence, the first transaction must be ordered:
{a,b,c} There are two possible merges:
{a,b,c}, {a,b,d} æ {a,b,c,d} {a,c,d}, {a,c,e} æ {a,c,d,e}
(b) {a,c,d,e} must be infrequent because {a,d,e} was not listed as a frequent 3-itemset.
We would need to scan the set of transactions to determine if {a,b,c,d} is frequent.
Apriori Algorithm: Example
An illustration of the Apriori algorithm using the transactions on slide 5 with minsup = 3/5.
Candidate 1-itemsets
Beer 3 Bread 4
Cola 2
Diapers 4 Milk 4
Eggs 1
Candidate 2-itemsets
{Beer, Bread} 2
{Beer, Diapers} 3
{Beer, Milk} 2
{Bread, Diapers} 3 {Bread, Milk} 3 {Diapers, Milk} 3
Candidate 3-itemsets
{Bread, Diapers, Milk} 2
The brute force approach generates 41 candidate itemsets, whereas the Apriori algorithm generates 13.
, University of Canterbury 2021
STAT318 — Data Mining ,15 / 28
There are many tricks to speed up the algorithm (for example using hash trees for support counting), but these are not covered here because they do not change the basic structure of the algorithm.
The Apriori algorithm represents one of the major advances in data mining technology (The Elements of Statistical Learning, Hastie et al.).
Maximal Frequent Itemsets
The number of frequent itemsets found by the Apriori algorithm is usually huge, especially when minsup is low.
Example: if a frequent 100-itemset is found, there are ¥ 1.27 ◊ 1030 frequent sub-itemsets!
Maximal Frequent Itemset: a frequent itemset for which none of its proper (immediate) super sets are frequent.
Maximal frequent itemsets provide a compact representation of frequent itemsets, but they contain no information about the support counts of their power sets (all frequent itemsets).
, University of Canterbury 2021
STAT318 — Data Mining ,16 / 28
Maximal Frequent Itemsets
null
abcde
ab ac ad ae bc bd be cd ce de
abc abd abe acd ace ade bcd bce bde cde
abcd abce abde acde bcde
abcde
, University of Canterbury 2021
STAT318 — Data Mining ,17 / 28
Let the red crosses denote infrequent itemsets. Then the un-crossed itemsets are frequent and the red ones are maximal frequent.
• All proper super sets of
{a,d}, {a,c,e}, {b,c,d,e}
• {a,c} is not maximal frequent because {a,c,e} is frequent and {a,c} μ {a,c,e}.
{a,d}, {a,c,e}, {b,c,d,e} • The power sets of
are infrequent. are frequent.
Closed Frequent Itemsets
Closed Itemsets: an itemset X is closed if none of its proper (immediate) super sets has exactly the same support count as X.
Closed Frequent Itemsets: a closed itemset with support greater than or equal to minsup.
Example: Using minsup = 2/5, find the closed frequent itemsets in:
ID Items
1 {a,b,c}
2 {a,b,c,d} 3 {b,c,e}
4 {a,c,d,e} 5 {d,e}
, University of Canterbury 2021
STAT318 — Data Mining ,18 / 28
Closed itemsets put another way:
An itemset is not closed if at least one of its proper super sets has the same support.
Closed Frequent Itemsets
2 ab
2 abc
3 ac
1 abd
2 ad
abe
1 abcd
1 ae
2 acd
abce
3 bc 1 bd
1 ace 1 ade
abde
abcde
1 be
1 bcd
1 acde
2 cd
1 bce
bcde
2 ce
bde
2 de
1 cde
null
3a3b4c3d3e
, University of Canterbury 2021
STAT318 — Data Mining ,19 / 28
The closed itemsets are
{c}, {d}, {e}, {a,c}, {b,c}, {c,e}, {d,e}, {a,b,c}, {a,c,d}.
All the subsets of closed frequent itemsets are frequent.
There are more closed frequent itemsets than maximal frequent itemsets.
The advantage of using closed frequent itemsets over maximal frequent itemsets is that the support of non-closed frequent itemsets can be obtained from closed frequent itemsets. No additional scans of the data base is required to calculate their support. The support of a non-closed frequent X is equal to the support of the proper super set of X that is closed with the largest support.
For example, the support of {a,d} (non-closed frequent) is ‡({a,d}) = ‡({a,c,d}) = 2.
Rule Generation
Each frequent k-itemset, Y, has 2k association rules of the form
X æY \X,
Example: Let Y = {a,b,e} be a frequent itemset, then the possible association
where X μ Y . rules are:
ÿ æ {a,b,e}, {a,b} æ {e}, {a,e} æ {b}, {b,e} æ {a} {a} æ {b,e}, {b} æ {a,e}, {e} æ {a,b}, {a,b,e} æ ÿ.
The goal is to find strong rules:
c(X æY \X)= s(Y) Øminconf. s(X)
, University of Canterbury 2021
STAT318 — Data Mining ,20 / 28
The rules
are not interesting and hence, we will discard these rules.
ÿæY and Yæÿ
Confidence Based Pruning
It is not necessary to calculate the confidence of every possible rule from a frequent itemset.
Let Y be a frequent itemset with X μ Y . If a rule fails the minconf threshold X æY \X
bcd -> a
cd -> ab bd -> ac
d -> abc
acd -> b
bc -> ad
c -> abd
abd -> c
ad -> bc
b -> acd
abc -> d
ac -> bd ab -> cd
a -> bcd
, University of Canterbury 2021
STAT318 — Data Mining ,22 / 28
If {a,c,d} æ {b} is a low confidence rule, then all rules with ‘b’ in the consequent (RHS) can be pruned (they must be low confidence rules as well).
Strong rules are not necessarily interesting
Example: Consider the following association rule and contingency table:
{Tea} æ {Co ee} Tea
Not Tea
(support = 15%, confidence = 75%).
Co ee 150 650 800
Not Co ee
50 200
150 800 200 1000
(a) (b)
Pr(Co ee) = Pr(Co ee| Tea) =
, University of Canterbury 2021
STAT318 — Data Mining ,23 / 28
(a) Pr(Co ee) = 800/1000 = 0.8
(b) Pr(Co ee| Tea) = 150/200 = 0.75
The rule says ‘those that purchase Tea also tend to purchase Co ee.’ However, we know (from part (b)) that if someone purchases Tea, it actually decreases the chance of purchasing Co ee as well.
This rule has high support and confidence, but it is misleading!
Objective Interestingness Measures
Lift is an interestingness measure based on correlation.
Lift: Given two itemsets X and Y , the lift measure is defined as
lift(X , Y ) = s (X fi Y ) = c (X æ Y ) . s(X)s(Y) s(Y)
Interpretation:
Y_]= 1, lift(X , Y ) _> 1, [< 1,
X and Y are statistically independent X and Y are positively correlated
X and Y are negatively correlated.
Example: Find lift(Tea, Co ee).
, University of Canterbury 2021
STAT318 — Data Mining ,24 / 28
If lift(X , Y ) < 1, the occurrence of X inhibits the occurrence of Y . If the lift(X,Y) > 1, the occurrence of one implies (promotes) the occurrence of the other.
lift(Tea, Co ee) = s (Tea fi Co ee) = 1000(150) = 0.9375 < 1. s (Tea)s (Co ee) 200(800)
The occurrence of Tea reduces the likelihood of Co ee (a negative correlation).
Subjective Interestingness Measures
• Unexpectedness (the rule contradicts domain knowledge) • Actionable (can do something useful with the rule)
Objective Interestingness Measures
Cosine: Given two itemsets X and Y , the cosine measure is defined by s(X fiY)
cosine(X,Y) = s(X)s(Y)
= Òc(XæY)c(YæX).
A cosine value close to one indicates that most of the transactions containing X also contain Y , and vice versa. A value close to zero means most transactions containing X do not contain Y , and vice versa.
, University of Canterbury 2021
STAT318 — Data Mining ,25 / 28
cosine(Tea, Co ee) = 150/Ò200(800) = 0.375 The cosine measure will be low when one of the rules
XæY or YæX, has low confidence. In our example
c({Co ee} æ {Tea}) = 150/800 ¥ 19% That is, transactions containing Co ee tend not to contain Tea.
Objective Interestingness Measures
Symmetric Measure: a measure M is symmetric i M(X æY)=M(Y æX).
Asymmetric Measure: a measure that is not symmetric.
Null Transaction: a transaction that does not include any of the itemsets being
examined.
Null Invariant: a measure is null invariant if its value does not depend on null transactions.
, University of Canterbury 2021
STAT318 — Data Mining ,26 / 28
• Symmetric measures are usually useful for evaluating itemsets and asymmetric measures are useful for evaluating association rules.
• Null invariant measures are useful because they can find interesting associations in data containing many null transactions. If a measure is not null invariant, interesting associations can be lost by simply adding null transactions to the data base.
Objective Interestingness Measures
Example: Is lift a symmetric measure? Is it null invariant? Example: Is cosine a symmetric measure? Is it null invariant?
, University of Canterbury 2021
STAT318 — Data Mining ,27 / 28
Lift is symmetric
lift(X,Y)= s(XfiY) =lift(Y,X), s(X)s(Y)
but it is not null invariant
lift(X,Y)= N‡(X fiY).
Increasing the number of null transactions increases N, but the support counts remain unchanged. If we add many null transactions, lift will become large and suggest that there is a positive correlation between X and Y ! Cosine is symmetric
s(X fiY)
cosine(X,Y) = Òs(X)s(Y) = cosine(Y,X),
and it is null invariant (it does not depend on N). Increasing the number of null transactions will not change the cosine value.
‡(X)‡(Y)
Summary
Association rule mining consists of finding frequent itemsets, from which strong association rules are formed.
The Apriori algorithm is a seminal method for association analysis.
Association rules are among data mining’s biggest successes (Hastie, Tibshirani, Friedman).
, University of Canterbury 2021
STAT318 — Data Mining ,28 / 28
One of the limitations of the Apriori algorithm is that high confidence rules with low support will be missed because minsup is usually too large. For example
{Vodka} æ {Caviar}
will have very low support, but it could have high confidence. Decreasing minsup to find these rules a ects the computational feasibility of the algorithm.