CS计算机代考程序代写 data mining DNA database algorithm CS699

CS699
Lecture 8 Association Rule Mining

What Is Frequent Pattern Analysis?
 Frequent pattern: a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set
 First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of frequent itemsets and association rule mining
 Motivation:Findinginherentregularitiesindata
 What products were often purchased together?— Beer and diapers?!  What are the subsequent purchases after buying a PC?
 What kinds of DNA are sensitive to this new drug?
 Can we automatically classify web documents?
 Applications
 Basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis.
2

Why Is Freq. Pattern Mining Important?
 Freq. pattern: An intrinsic and important property of datasets
 Foundation for many essential data mining tasks
 Association, correlation, and causality analysis
 Sequential, structural (e.g., sub-graph) patterns
 Pattern analysis in spatiotemporal, multimedia, time- series, and stream data
 Classification: discriminative, frequent pattern analysis
 Cluster analysis: frequent pattern-based clustering
 Data warehousing: iceberg cube and cube-gradient
 Broad applications
3

• Two step process
Association Rule Mining
• First, mine all frequent patterns
A frequent pattern is also called a frequent itemset or a large itemset
• Second, mine strong rules from frequent itemsets
4

Basic Concepts: Frequent Itemsets
• itemset: A
• k-itemset:
• 1-itemset:
• 2-itemset:
• 3-itemset:
set of one or more items a set of k items
Tid
Items bought
10
Beer, Nuts, Diaper
20
Beer, Coffee, Diaper
30
Beer, Diaper, Eggs
40 50
Beer, Nuts, Eggs, Milk Nuts, Coffee, Diaper, Eggs, Milk
{beer}, {nuts}, {diaper}, {coffee}, …
{beer, nuts}, {beer, coffee}, {eggs, milk}, … {beer, nuts, diaper}, {nuts, coffee, eggs}, …
5

Basic Concepts: Frequent Itemsets
Tid
Items bought
10
Beer, Nuts, Diaper
20
Beer, Coffee, Diaper
30
Beer, Diaper, Eggs
40 50
Beer, Nuts, Eggs, Milk Nuts, Coffee, Diaper, Eggs, Milk
 (absolute)support,or,supportcountofX:Frequencyorthenumberof transactions that contain itemset X
 (relative)support,s,isthefractionoftransactionsthatcontainX(i.e., the probability that a transaction contains X)
 Whenwesay“support”itcouldmeaneither.So,interpretitinthe context.
 Supportof{beer}:4(count),0.8(4outof5),or80%
 Supportof{coffee,diaper}:2,0.4(2outof5),or40%
6

Basic Concepts: Frequent Itemsets
Tid
Items bought
10
Beer, Nuts, Diaper
20
Beer, Coffee, Diaper
30
Beer, Diaper, Eggs
40 50
Beer, Nuts, Eggs, Milk Nuts, Coffee, Diaper, Eggs, Milk
 An itemset X is frequent if X’s support is no less than a predefined minimum support threshold, minsup.
 If minsup = 0.6 or 60%
 {beer} is frequent, or is a frequent itemset, or is a large
itemset, or is a frequent pattern
 {coffee, diaper} is not frequent, or is not a frequent itemset/pattern
7

Basic Concepts: Association Rules
 AruleR1=X→Y  Support of R1:
s(R1) = support(X  Y),
or probability that a transaction contains X  Y
 Confidence of R1:
c(R1) = support(X  Y) / support(X),
or conditional probability that a transaction having X also contains Y
8

Tid
Items bought
Customer buys both
Customer buys diaper
10 20 30 40 50
Beer, Nuts, Diaper
Beer, Coffee, Diaper
Beer, Diaper, Eggs
Beer, Nuts, Eggs, Milk Nuts, Coffee, Diaper, Eggs, Milk
Basic Concepts: Association Rules
 R1={diaper}→{beer}
s(R1) = support({diaper, beer}) = 3 (count), 0.6 or 60%
Customer buys beer
3 transactions (or 60% of all transactions) contain both diaper and beer.
c(R1) = support({diaper, beer}) / support({diaper}) = 3 / 4 = 0.75 or 75%
Among those who purchased diaper, 70% of them also purchased beer.
9

Basic Concepts: Association Rules
Tid
Items bought
10
Beer, Nuts, Diaper
20
Beer, Coffee, Diaper
30
Beer, Diaper, Eggs
40 50
Beer, Nuts, Eggs, Milk Nuts, Coffee, Diaper, Eggs, Milk
with support ≥ minimum support and  Let minsup = 40%, minconf = 60%
 Find all rules X  Y
confidence ≥ minimum confidence. They are called strong rules.
 Then,
Beer  Eggs (support = 40%, confidence = 50%), is not a strong rule Eggs Beer (support = 40%, confidence = 67%), is a strong rule
10

The Downward Closure Property and Scalable Mining Methods
 Apriori property of frequent itemsets
 Any nonempty subset of a frequent itemset must be
frequent
 If {beer, diaper, nuts} is frequent, so is {beer, diaper}.
 Because every transaction having {beer, diaper, nuts}
also contains {beer, diaper}  It can be also stated as:
 If an itemset contains a subset that is not frequent, then the itemset can never be frequent.
 Consider an itemset X = {milk, cheese, egg}.
 If {milk, egg} is not frequent, then X can never be frequent.
11

The Downward Closure Property and Scalable Mining Methods
 Scalable mining methods: Three major approaches
 Apriori (Agrawal & Srikant@VLDB’94)
 Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)
 Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)
12

Apriori: A Candidate Generation & Test Approach
 PruningusingAprioriproperty:Ifanitemsethasasubsetwhichis infrequent, then the itemset should not be tested!
(“test” means: to determine whether an itemset is frequent or not)
 Algorithm(simplified):
 Initially, scan DB once to get frequent 1-itemset
 Generate length (k+1) candidate itemsets from length k frequent itemsets
 Prune candidate itemsets using Apriori property
 Test the candidates against DB
 Terminate when no frequent or candidate set can be generated
 Forsimplicity,wewillnotdiscusshowtoprunecandidateitemsets.
13

The Apriori Algorithm – An Example
Database TDBSupmin = 2 Itemset {A}
sup
Itemset sup
Tid Items C1 {B}
2 3 3 1 3
L {A} 2 1 {B} 3
10 A, C, D 20 B,C,E 30 A,B,C,E 40 B, E
1st scan {C} {D}
{C} {E}
3 3
C2 L2 Itemset sup
Itemset
sup
C2
Itemset
{A, C} 2 {B, C} 2 {B, E} 3 {C, E} 2
{A, B} {A, C} {A, E} {B, C} {B, E} {C, E}
1 2 1 2 3 2
2nd scan
{A, B} {A, C} {A, E} {B, C} {B, E} {C, E}
C3 Itemset {B, C, E}
3rd scan
L3
Itemset sup
{E}
{B, C, E} 2
14

The Apriori Algorithm (outline)
1. Scan DB and find candidate 1-itemsets → C1
2. Mine frequent 1-itemsets from C1 → L1
3. Generate candidate 2-itemsets from L1 → C2 (w/o count)
4. Scan DB and count → C2 (with count)
5. Mine frequent 2-itemsets from C2 → L2
6. k=3
7. Generate candidate k-itemsets from Lk-1 → Ck (w/o count)
8. Scan DB and count → Ck (with count)
9. Mine frequent k-itemsets from Ck → Lk
10. k=k+1,andGoToStep7
***. Stop when Ck is empty or Lk is empty
15

Implementation of Apriori
 How to generate Ck+1 from Lk
 Join two k-itemsets to generate (k+1)-itmesets
 When joining two k-itemsets, we join only if the first k- 1 items are identical.
 Example: Generate C4 from L3.
 L3={abc, abd, acd, ace, bcd}
 Generate abcd from abc and abd  Generate acde from acd and ace
When joining two 3-itemsets, we join only if the first 2 items are identical.
 C4 = {abcd}
16

• •
Items are sorted.
Pruning is not shown in the subsequent steps.
Another Example
Dataset (min. support = 30% or three transactions) Customer Items
C1 beer, bread, chip, egg
C2 beer, bread, chip, egg, popcorn,
C3 bread, chip, egg
C4 beer, bread, chip, egg, milk, popcorn C5 beer, bread, milk
C6 beer, bread, egg
C7 bread, chip, milk
C8 bread, butter, chip, egg, milk
C9 butter, chip, egg

C1
Itemset
Support count
Candidate 1‐itemsets
{beer} 5 {bread} 8 {butter} 2 {chip} 7 {egg} 7 {milk} 4 {popcorn} 2

L1
Itemset
Support count
Frequent 1‐itemsets
{beer} 5 {bread} 8 {chip} 7 {egg} 7 {milk} 4

C2
Itemset {beer, bread} {beer, chip} {beer, egg} {beer, milk} {bread, chip} {bread, egg} {bread, milk} {chip, egg} {chip, milk} {egg, milk}
Candidate 2‐itemsets

Candidate 2‐itemsets with Counts
C2 (scan the database and count the supports)
Support count {beer, bread} 5
{beer, chip} 3
{beer, egg} 4
{beer, milk} 2 {bread, chip} 6 {bread, egg} 6 {bread, milk} 4
{chip, egg} 6
{chip, milk} 3
{egg, milk} 2
Itemset

L2
Itemset {beer, bread} {beer, chip} {beer, egg} {bread, chip} {bread, egg} {bread, milk} {chip, egg} {chip, milk}
Support count 5
Frequent 2‐itemsets
3 4 6 6 4 6 3

C3
Itemset
{beer, bread, chip} {beer. bread, egg} {beer, chip, egg} {bread, chip, egg} {bread, chip, milk} {bread, egg, milk} {chip, egg, milk}
Two frequent 2‐itemsets are joined only if the first items are identical.
Candidate 3‐itemsets
We join {beer, bread} and {beer, chip} to generate {beer, bread, chip}.
But, we do not join {beer, chip} and {chip, egg}.
Note: Pruning is done at this step before counting but it is omitted here.

Candidate 3‐itemsets with Supports
C3 (scan the database and count supports)
Support count {beer, bread, chip} 3
{beer. bread, egg} 4
{beer, chip, egg} 3 {bread, chip, egg} 5 {bread, chip, milk} 3 {bread, egg, milk} 2
{chip, egg, milk} 2
Itemset

L3
Frequent 3‐itemsets
Support count {beer, bread, chip} 3
{beer. bread, egg} 4
{beer, chip, egg} 3 {bread, chip, egg} 5 {bread, chip, milk} 3
Itemset

C4
Itemsets {beer, bread, chip, egg} {bread, chip, egg, milk}
Candidate 4‐itemsets
Again, we join two frequent 3‐itemsets only if the first two items are identical.
{beer, bread, chip} JOIN {beer, bread, egg} generates {beer, bread, chip, egg} {bread, chip, egg} JOIN {bread, chip, milk} generates {bread, chip, egg, milk}

Candidate 4‐itemsets after Pruning
C4 (scan the database and count the supports)
Itemsets Support count {beer, bread, chip, egg} 3
.

L4
Frequent 4‐itemsets
Itemsets Support count {beer, bread, chip, egg} 3

All Frequent Itemsets
Frequent itemsets L = L1  L2  L3  L4
L = {{beer}, {bread}, {chip}, {egg}, {milk}, {beer, bread},
{beer, chip}, {beer, egg}, {bread, chip}, {bread, egg}, {bread, milk}, {chip, egg}, {chip, milk}, {beer, bread, chip}, {beer. bread, egg}, {beer, chip, egg}, {bread, chip, egg}, {bread, chip, milk}, {beer, bread, chip, egg}}

Mining Strong Rules
• For each frequent itemset, identify all nonempty proper subsets:
• Example: from {beer, bread, egg}
• All nonempty proper subsets are:
{beer}, {bread}, {egg}, {beer, bread}, {beer, egg}, {bread, egg} • For each subset, we form a rule:
R1: {beer}  {bread, egg} R2: {bread}  {beer, egg} R3: {egg}  {beer, bread} R4: {beer, bread}  {egg} R5: {beer, egg}  {bread} R6: {bread, egg}  {beer}

Mining Strong Rules
• Compute the confidences:
confidence = sup(all items) / sup(antecedent)
conf(R1) = (sup({ beer, bread, egg})) / sup({beer}) = 4/5 = 80% conf(R2) = (sup({ beer, bread, egg})) / sup({bread} )= 4/8 = 50% conf(R3) = (sup({ beer, bread, egg})) / sup({egg}) = 4/7 = 57.1% conf(R4) = (sup({ beer, bread, egg})) / sup({beer, bread}) = 4/5 = 80% conf(R5) = (sup({ beer, bread, egg})) / sup({beer, egg}) = 4/4 = 100% conf(R6) = (sup({ beer, bread, egg})) / sup({bread, egg}) = 4/6 = 66.7%

Mining Strong Rules
• Choosetheruleswhoseconfidencessatisfy minimum confidence.
• Ifmin_conf=80%,R1,R4,andR5arestrongrules.
• Ifmin_conf=60%,R1,R4,R5,andR6arestrong rules.

• Then, mine all strong rules from
the first frequent 3‐itemset (when 3‐itemsets are sorted by the items). Assume that the minimum confidence is 80%.
TID Items
Exercise
• Mine all frequent itemsets from the following dataset. Assume that the minimum support is 30% (or 3 transactions).
100 2,4,5,6 200 1,4,5,7 300 2,4,5
400 1,2,4,5,6,7 500 1,2,6
600 1,2,5,7 700 2,4,6 800 2,3,4,5,6 900 3,4,5,6

• http://www.cs.illinois.edu/~hanj/bk3/
34
References
• Han, J., Kamber, M., Pei, J., “Data mining: concepts and techniques,” 3rd Ed., Morgan Kaufmann, 2012