程序代写代做代考 data mining information retrieval database algorithm data structure Pattern Analysis & Machine Intelligence Research Group

Pattern Analysis & Machine Intelligence Research Group

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining
1

Mining Rule Association

Material in this section is based on the following
references
1. Margaret Dunham, Data Mining Introductory

and Advanced Topics, Prentice Hall, 2003.
2. Jiawei Han, Micheline Kamber & Jian Pei, Data

Mining: Concepts and Techniques, 3rd ed,
Morgan Kaufmann Publishers, May 2011P.

3. Tan, M. Steinbach ad V. Kumar, Introduction to
Data Mining, Addison Wesley, 2006.

4. Rakesh Agrawal and Ramakrishnan Srikant,
Fast algorithms for mining association rules in
large databases, Proceedings of the 20th
International Conference on Very Large Data
Bases, VLDB, pages 487-499,1994.

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining

Assume we have the following transaction data set

Example: Market Basket Transactions

5 items (shirt, trouser, tie, socks, belt)
Possible itemsets 25 −1=31

Transaction Items

t1
shirt, trouser, tie

t2
shirt, tie

t3 trouser, socks, tie

t4
trouser, belt

t5 shirt, trouser

t6 shirt, socks

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining

• Information to be gathered from datasets,
databases

• Finding relationships between items in a dataset
• Finding associations between items that

occur typically together

Association Rules

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining

For transaction datasets, the rules can be used for:
a) Organization of items (e.g. in a market – shirts &

ties, trousers, belts)
b) Prediction of the sales and inventory (shopping

behaviour)
c) Advertising or promotion
d) Recommendation system (Amazon customers who

bought x also bought y…)

For monitoring datasets, the rules can be used to predict
faults or load on the system etc. (news, movies, CD’s)

Applications of Using These Rules

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining

Formulation of the Problem

5

Given a dataset of transactions
𝐷𝐷 = {𝑡𝑡1, 𝑡𝑡2, … , 𝑡𝑡𝑛𝑛}

Where each

𝑡𝑡𝑖𝑖 = {𝐼𝐼𝑖𝑖1 , 𝐼𝐼𝑖𝑖2 , … , 𝐼𝐼𝑖𝑖𝑘𝑘}

is a set of items from possible items

𝐼𝐼 = 𝐼𝐼1, … , 𝐼𝐼𝑚𝑚 such that 𝑘𝑘 ≤ 𝑚𝑚

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining

Formulation of the Problem

6

An association rule is an implication of the form

, where x, y ⊂ I are sets of items

and – no intersection
– no item is in both
– why? (circular implication)

An itemset can be viewed as a conjunction of items

x⇒y

𝑥𝑥 ∩ 𝑦𝑦 = 0

called itemsets.

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining
6

Support of an itemset 𝑿𝑿 is the percentage of

transactions in which that item (or itemset)

occurs

𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑡𝑡 𝑥𝑥 =
|𝑥𝑥|
𝑛𝑛

Definition of Support of an item or itemset

Where 𝑛𝑛 is the number of transactions

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining

Assume we have the following transaction data set

Example: Market Basket Transactions

5 items (shirt, trouser, tie, socks, belt)
Possible itemsets 25 −1=31

Transaction Items

t1
shirt, trouser, tie

t2
shirt, tie

t3 trouser, socks, tie

t4
trouser, belt

t5 shirt, trouser

t6 shirt, socks

Today’s Class
Itemset Support %

Shirt 4/6 66
Trouser 4/6 66

Tie 3/6 50
Socks 2/6 33
Belt 1/6 16

Shirt, trouser 2/6 33
Shirt, tie 2/6 33

Shirt, socks 1/6 16
Shirt, belt 0

Trouser, tie 2/6 33
Trouser, socks 1/6 16
Trouser, belt 1/6 16
Tie, socks 1/6 16
Tie, belt 0

Presenter
Presentation Notes
Now we can compute the support for each

Today’s Class

Itemset Support %
Socks, belt 0

Shirt, trouser, tie 1/6 16
Shirt, trouser, socks 0
Shirt, trouser, belt 0

: :
: :

Trousers, socks, tie 1/6 16
: :
: :

: :
4 items 0
5 items 0

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining

Support of an Association Rule

s =support (x⇒ y)= # of (x∪ y)
n

i.e. frequency of the rule in the dataset

For an association rule the support is the
percentage of transactions that contain 𝑥𝑥 ∪
𝑦𝑦 𝑖𝑖𝑛𝑛

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining

Confidence (or strength) 𝛼𝛼 of an
association rule

𝑥𝑥 ⟹ 𝑦𝑦
Is the ratio of
• the number of transactions that

contains 𝑥𝑥∪𝑦𝑦
• to the number of transactions that

contain 𝑥𝑥

Confidence of an Association Rule

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining

α = confidence (x⇒ y)= # of (x∪ y)
# of x

s(x ⇒ y)
s(x)=

Confidence of an Association Rule

1
3

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining

Example

12

Notice the non-symmetry of α , denominator is different
(notice, items can be categorial or nomial, it can be
generalized for interval data as well)

x⇒ y s α
shirt ⇒ tie 33% 50%
tie ⇒ shirt 33% 66%

shirt,tie⇒trouser 16% 50%
tie ⇒ belt 0% 0%

:
:

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining

Association Rule Mining Problem

The Association Rule Mining Problem is to find the
association rules that have support and confidence
above specified support and confidence thresholds

(s ′,α′)
Brute Force Method
List all possible rules and calculate their support and
confidence and select those that are larger than the
specified thresholds.

Note: This is often not practical for large data sets

m items ⇒2m −1 possible rules

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining

Observations

1
6

Observation 1: “Large support rules” are composed of
large support itemsets

support (x ⇒ y)= support (x∪ y)
We need an efficient way to generate large support itemsets

(call them large itemsets, (large means ≥ s′) )

Observation 2: any subset of a large itemset is also large

As x or y is
counted
in each
occurrence
+ its own
occurence

support(x ∪ y )≤ support (x )
support( x ∪ y ) ≤ support (y )

ECE 657A: Lecture 8 – ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining

Support Thresholds

1
7

If 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑡𝑡 𝑥𝑥 ∪ 𝑦𝑦 > 𝑠𝑠𝑠
Then 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑡𝑡 𝑥𝑥 > 𝑠𝑠𝑠

𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑡𝑡 𝑦𝑦 > 𝑠𝑠𝑠

Also, a superset of an item that is not large
is not large

if support (x) < s′ then support (x∪ y)< s′ (x, y can be subsets) ECE 657A: Lecture 8 - ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining (most common and used in commercial packages) Notation Itemsets D Dataset of transactions Li Large itemsets of size i Is the set of large itemsets Candidate itemsets of size iCi L I s ′,α ′ support & confidence thresholds Consisting of i items The Apriori Algorithm ECE 657A: Lecture 8 - ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining Simply start with single items and find a set 𝐿𝐿1 of large itemsets (> s′)
of size 1. Apply joins between the items of
L1 to form candidate sets of size 2, C2
Example
Suppose

and so on.

s′ =17% α′ = 51%
C1 ={shirt},{trouser},{tie},{socks},{belt}

L1 ={shirt},{trouser},{tie},{socks}

(Agrawal & Srikant, 1994)

The Apriori Algorithm

s <17% Presenter Presentation Notes C1 is the set of all currently available items at the start ECE 657A: Lecture 8 - ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining C ={shirt,trouser},{ },{ }2 shirt,tie shirt,socks {trouser,tie},{trouser,socks} {tie,socks} L2 ={shirt,trouser},{shirt,tie},{trouser,tie} C3 ={shirt,trouser,tie} L3 =φ Suppose is not large. Should C3 beφ? As one of its subsets is not large pruning {shirt,tie} Joining and Pruning 2 0 ECE 657A: Lecture 8 - ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining 17 Rules: for each ∈L for each x ⊂  x ≠φ if support()≥α′ then accept x⇒( − x) support(x) s(shirt , trouser ) s(shirt ) s(shirt,trouser ) s(trouser ) = 50 reject = 50 reject shirt ⇒ trouser trouser ⇒ shirt Apriori Algorithm Example ECE 657A: Lecture 8 - ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining s(shirt ) s(tie , shirt ) = 2 = 66 accept s(tie ) s(trouser , tie ) s(trouser ) s(tie , trouser ) = 2 = 66 accept s(tie ) 3 = 50 reject 3 s(shirt , tie )= 50 reject tie ⇒ shirt tie⇒ trouser shirt ⇒ tie trouser ⇒ tie Apriori Algorithm Example ECE 657A: Lecture 8 - ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining 20 K = K +1 CK =Apriori -Gen (LK−1, s′) L = L1 = set itemsets of size1 that meets the s′ threshold K =1 Repeat Generate candidates by doing the join step and prune step Initialize count for each candidate in CK to 0 for each transaction in D Increment count for any matching item in CK endfor LK = items from CK with count ≥ s′ L = L∪ Lk Until (LK is empty)or (K = max # of items) Return L Apriori Algorithm Presenter Presentation Notes Iteratively builds those lists of large item sets by multiple passes of the database 21 Apriori −Gen(Li−1, s′) Ci =φ for each I∈Li-1 do for each J ≠ I∈Li-1do if i-2 of the elements in I and J are equal then Ci = Ci ∪{I∪ J} Remove any items that are< s′ Remove any items if any subset is < s′ Complexity of Apriori is 0(m)scans of the database, m is the number of items.Each scan goes through the n transactions (i.e. 0(nm)) Presenter Presentation Notes Given the previous large item set and a threshold scan all remaining pairs for possible inclusion into th elarge item set ECE 657A: Lecture 8 - ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining Apriori algorithm requires large # of scans A. How to Deal with Large Databases Some approaches on how to deal with this include: 22 Issues & Further Development  Sampling the database (Reduce the support to get largest itemsets in the sample, then scan the database to get larger itemsets)  Partitioning the database (large itemsets must be large in at least one of the partitions) Task Parallel  Parallel & distributed algorithm  Hashing function & tree structure Data Parallel ECE 657A: Lecture 8 - ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining B. Dealing with Quantitative Data Rather than dealing with categorial values only some attributes will have numeric values. The range of these values may be partitioned into intervals and the intervals are used as values in the itemsets which may increase the possible items. e.g a rule might be Acustomer who buys a shirt between $40 and $60 Will also buy a tie between $20 and $30 Support needs to check matching interval or superset Issues & Further Development ECE 657A: Lecture 8 - ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining Other Measures forAssociation Use probabilities S(A⇒ B)= P(A,B) frequency of A,B α(A⇒ B)= P(B A) frequency (A,B)/freq(A) Problem with confidence: • Ignores P(B) : obviousness of a rule has no impact, prior probability of association ignored Support: Confidence: Measuring the Quality of Rules ECE 657A: Lecture 8 - ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining Interest/Lift/Surprise and Leverage 2 8 lift(A⇒ B)= support(A∪ B) (support(A)⋅support(B) • measures how many more times they occur together than expected if they were independent. • Problem: symmetric, so direction isn’t clear. Leverage • measures difference of Aand B appearing together and what would be expected if they are independent leverage(A⇒ B)= support(A∪B)− (support(A)⋅support(B) Lift ECE 657A: Lecture 8 - ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining A⇒ B ≡ ¬A∪B = ¬(A∩¬B) Conviction 2 9 support(A)support(¬B) support(AZ ∩ ¬B) • inverse of interest • measures the frequency of Aand absence of B • Value =1 if A and B not releated, Value=infinity is always related conviction = ECE 657A: Lecture 8 - ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining Web Mining The web has a lot of data that is targeted for data mining tasks. http://www.worldwidewebsize.com/ • Web pages content: text, image, videos, audio, blogs. • Web usage: log records of use, log record of servers • Web structure: links, ranking, tags, Data is dynamic, mainly semi- structured, linked, noisy and redundan27t. (Margaret Dunham, Data Mining Introductory and Advanced Topics, Prentice Hall, 2003.) http://www.worldwidewebsize.com/ ECE 657A: Lecture 8 - ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining Web Content Mining 3 1 Data: webpages, search engine results Purpose: • Improve search engine results: returning more relevant content • Personalization Recommended Methods and Tools: Text mining, multimedia mining, sentiment and opinion mining, classification, clustering, association rules ECE 657A: Lecture 8 - ClusteringMark CrowleyMark Crowley ECE 657A: Lecture 8 – Association Rule Mining Web Usage Mining 3 2 Purpose: • Improve server performance : speeding response by prediction • E-commerce: identify target clients and potential preferences • Page ranking Methods and Tools: Data structures, information retrieval, clustering of users, association rules Data: access log files on servers, user profiles, cookies Mining Rule Association Example: Market Basket Transactions Association Rules Applications of Using These Rules� Formulation of the Problem Formulation of the Problem� Definition of Support of an item or itemset Example: Market Basket Transactions Slide Number 9 Slide Number 10 Support of an Association Rule Confidence of an Association Rule Confidence of an Association Rule Example Association Rule Mining Problem Observations Support Thresholds The Apriori Algorithm The Apriori Algorithm� Joining and Pruning� Apriori Algorithm Example Apriori Algorithm Example� Apriori Algorithm Slide Number 24 Issues & Further Development� Issues & Further Development Measuring the Quality of Rules� Interest/Lift/Surprise and Leverage Conviction Web Mining Web Content Mining Web Usage Mining