程序代写代做代考 data mining information retrieval algorithm database data structure chain CS 361A (Advanced Algorithms)

CS 361A (Advanced Algorithms)

CS 361A
(Advanced Data Structures and Algorithms)
Lecture 20 (Dec 7, 2005)
Data Mining: Association Rules

Rajeev Motwani
(partially based on notes by Jeff Ullman)

Association Rules Overview
Market Baskets & Association Rules
Frequent item-sets
A-priori algorithm
Hash-based improvements
One- or two-pass approximations
High-correlation mining

Association Rules
Two Traditions
DM is science of approximating joint distributions
Representation of process generating data
Predict P[E] for interesting events E
DM is technology for fast counting
Can compute certain summaries quickly
Lets try to use them
Association Rules
Captures interesting pieces of joint distribution
Exploits fast counting technology

Market-Basket Model
Large Sets
Items A = {A1, A2, …, Am}
e.g., products sold in supermarket
Baskets B = {B1, B2, …, Bn}
small subsets of items in A
e.g., items bought by customer in one transaction
Support – sup(X) = number of baskets with itemset X
Frequent Itemset Problem
Given – support threshold s
Frequent Itemsets –
Find – all frequent itemsets

Example
Items A = {milk, coke, pepsi, beer, juice}.
Baskets

B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
Support threshold s=3
Frequent itemsets

{m}, {c}, {b}, {j}, {m, b}, {c, b}, {j, c}

Application 1 (Retail Stores)
Real market baskets
chain stores keep TBs of customer purchase info
Value?
how typical customers navigate stores
positioning tempting items
suggests “tie-in tricks” – e.g., hamburger sale while raising ketchup price

High support needed, or no $$’s

Application 2 (Information Retrieval)
Scenario 1
baskets = documents
items = words in documents
frequent word-groups = linked concepts.
Scenario 2
items = sentences
baskets = documents containing sentences
frequent sentence-groups = possible plagiarism

Application 3 (Web Search)
Scenario 1
baskets = web pages
items = outgoing links
pages with similar references  about same topic
Scenario 2
baskets = web pages
items = incoming links
pages with similar in-links  mirrors, or same topic

Scale of Problem
WalMart
sells m=100,000 items
tracks n=1,000,000,000 baskets
Web
several billion pages
one new “word” per page
Assumptions
m small enough for small amount of memory per item
m too large for memory per pair or k-set of items
n too large for memory per basket
Very sparse data – rare for item to be in basket

Association Rules
If-then rules about basket contents
{A1, A2,…, Ak}  Aj
if basket has X={A1,…,Ak}, then likely to have Aj
Confidence – probability of Aj given A1,…,Ak

Support (of rule)

Example
B1 = {m, c, b} B2 = {m, p, j}
B3 = {m, b} B4 = {c, j}
B5 = {m, p, b} B6 = {m, c, b, j}
B7 = {c, b, j} B8 = {b, c}
Association Rule
{m, b}  c
Support = 2
Confidence = 2/4 = 50%

Finding Association Rules
Goal – find all association rules such that
support
confidence
Reduction to Frequent Itemsets Problems
Find all frequent itemsets X
Given X={A1, …,Ak}, generate all rules X-Aj  Aj
Confidence = sup(X)/sup(X-Aj)
Support = sup(X)
Observe X-Aj also frequent  support known

Computation Model
Data Storage
Flat Files, rather than database system
Stored on disk, basket-by-basket
Cost Measure – number of passes
Count disk I/O only
Given data size, avoid random seeks and do linear-scans
Main-Memory Bottleneck
Algorithms maintain count-tables in memory
Limitation on number of counters
Disk-swapping count-tables is disaster

Finding Frequent Pairs
Frequent 2-Sets
hard case already
focus for now, later extend to k-sets
Naïve Algorithm
Counters – all m(m–1)/2 item pairs
Single pass – scanning all baskets
Basket of size b – increments b(b–1)/2 counters
Failure?
if memory < m(m–1)/2 even for m=100,000 Montonicity Property Underlies all known algorithms Monotonicity Property Given itemsets Then Contrapositive (for 2-sets) A-Priori Algorithm A-Priori – 2-pass approach in limited memory Pass 1 m counters (candidate items in A) Linear scan of baskets b Increment counters for each item in b Mark as frequent, f items of count at least s Pass 2 f(f-1)/2 counters (candidate pairs of frequent items) Linear scan of baskets b Increment counters for each pair of frequent items in b Failure – if memory < m + f(f–1)/2 Memory Usage – A-Priori Candidate Items Pass 1 Pass 2 Frequent Items Candidate Pairs M E M O R Y M E M O R Y PCY Idea Improvement upon A-Priori Observe – during Pass 1, memory mostly idle Idea Use idle memory for hash-table H Pass 1 – hash pairs from b into H Increment counter at hash location At end – bitmap of high-frequency hash locations Pass 2 – bitmap extra condition for candidate pairs Memory Usage – PCY Candidate Items Pass 1 Pass 2 M E M O R Y M E M O R Y Hash Table Frequent Items Bitmap Candidate Pairs PCY Algorithm Pass 1 m counters and hash-table T Linear scan of baskets b Increment counters for each item in b Increment hash-table counter for each item-pair in b Mark as frequent, f items of count at least s Summarize T as bitmap (count > s  bit = 1)
Pass 2
Counter only for F qualified pairs (Xi,Xj):
both are frequent
pair hashes to frequent bucket (bit=1)
Linear scan of baskets b
Increment counters for candidate qualified pairs of items in b

Multistage PCY Algorithm
Problem – False positives from hashing
New Idea
Multiple rounds of hashing
After Pass 1, get list of qualified pairs
In Pass 2, hash only qualified pairs
Fewer pairs hash to buckets  less false positives

(buckets with count >s, yet no pair of count >s)
In Pass 3, less likely to qualify infrequent pairs
Repetition – reduce memory, but more passes
Failure – memory < O(f+F) Memory Usage – Multistage PCY Candidate Items Pass 1 Pass 2 Hash Table 1 Frequent Items Bitmap Frequent Items Bitmap 1 Bitmap 2 Candidate Pairs Hash Table 2 Finding Larger Itemsets Goal – extend to frequent k-sets, k > 2
Monotonicity

itemset X is frequent only if X – {Xj} is frequent for all Xj
Idea
Stage k – finds all frequent k-sets
Stage 1 – gets all frequent items
Stage k – maintain counters for all candidate k-sets
Candidates – k-sets whose (k–1)-subsets are all frequent
Total cost: number of passes = max size of frequent itemset
Observe – Enhancements such as PCY all apply

Approximation Techniques
Goal
find all frequent k-sets
reduce to 2 passes
must lose something  accuracy

Approaches
Sampling algorithm
SON (Savasere, Omiecinski, Navathe) Algorithm
Toivonen Algorithm

Sampling Algorithm
Pass 1 – load random sample of baskets in memory
Run A-Priori (or enhancement)
Scale-down support threshold (e.g., if 1% sample, use s/100 as support threshold)
Compute all frequent k-sets in memory from sample
Need to leave enough space for counters
Pass 2
Keep counters only for frequent k-sets of random sample
Get exact counts for candidates to validate
Error?
No false positives (Pass 2)
False negatives (X frequent, but not in sample)

SON Algorithm
Pass 1 – Batch Processing
Scan data on disk
Repeatedly fill memory with new batch of data
Run sampling algorithm on each batch
Generate candidate frequent itemsets
Candidate Itemsets – if frequent in some batch
Pass 2 – Validate candidate itemsets
Monotonicity Property

Itemset X is frequent overall  frequent in at least one batch

Toivonen’s Algorithm
Lower Threshold in Sampling Algorithm
Example – if sampling 1%, use 0.008s as support threshold
Goal – overkill to avoid any false negatives
Negative Border
Itemset X infrequent in sample, but all subsets are frequent
Example: AB, BC, AC frequent, but ABC infrequent
Pass 2
Count candidates and negative border
Negative border itemsets all infrequent  candidates are exactly the frequent itemsets
Otherwise? – start over!
Achievement? – reduced failure probability, while keeping candidate-count low enough for memory

Low-Support, High-Correlation
Goal – Find highly correlated pairs, even if rare
Marketing requires hi-support, for dollar value
But mining generating process often based on hi-correlation, rather than hi-support
Example: Few customers buy Ketel Vodka, but of those who do, 90% buy Beluga Caviar
Applications – plagiarism, collaborative filtering, clustering
Observe
Enumerate rules of high confidence
Ignore support completely
A-Priori technique inapplicable

Matrix Representation
Sparse, Boolean Matrix M
Column c = Item Xc; Row r = Basket Br
M(r,c) = 1 iff item c in basket r
Example

m c p b j
B1={m,c,b} 1 1 0 1 0
B2={m,p,b} 1 0 1 1 0
B3={m,b} 1 0 0 1 0
B4={c,j} 0 1 0 0 1
B5={m,p,j} 1 0 1 0 1
B6={m,c,b,j} 1 1 0 1 1
B7={c,b,j} 0 1 0 1 1
B8={c,b} 0 1 0 1 0

Column Similarity
View column as row-set (where it has 1’s)
Column Similarity (Jaccard measure)

Example

Finding correlated columns  finding similar columns

Ci Cj

0 1
1 0
1 1 sim(Ci,Cj) = 2/5 = 0.4
0 0
1 1
0 1

Identifying Similar Columns?
Question – finding candidate pairs in small memory
Signature Idea
Hash columns Ci to small signature sig(Ci)
Set of sig(Ci) fits in memory
sim(Ci,Cj) approximated by sim(sig(Ci),sig(Cj))
Naïve Approach
Sample P rows uniformly at random
Define sig(Ci) as P bits of Ci in sample
Problem
sparsity  would miss interesting part of columns
sample would get only 0’s in columns

Key Observation
For columns Ci, Cj, four types of rows

Ci Cj
A 1 1
B 1 0
C 0 1
D 0 0
Overload notation: A = # of rows of type A
Claim

Min Hashing
Randomly permute rows
Hash h(Ci) = index of first row with 1 in column Ci

Suprising Property

Why?
Both are A/(A+B+C)
Look down columns Ci, Cj until first non-Type-D row
h(Ci) = h(Cj)  type A row

Min-Hash Signatures
Pick – P random row permutations
MinHash Signature

sig(C) = list of P indexes of first rows with 1 in column C

Similarity of signatures
Fact: sim(sig(Ci),sig(Cj)) = fraction of permutations where MinHash values agree
Observe E[sim(sig(Ci),sig(Cj))] = sim(Ci,Cj)

Example
C1 C2 C3
R1 1 0 1
R2 0 1 1
R3 1 0 0
R4 1 0 1
R5 0 1 0
Signatures
S1 S2 S3
Perm 1 = (12345) 1 2 1
Perm 2 = (54321) 4 5 4
Perm 3 = (34512) 3 5 4
Similarities
1-2 1-3 2-3
Col-Col 0.00 0.50 0.25
Sig-Sig 0.00 0.67 0.00

Implementation Trick
Permuting rows even once is prohibitive
Row Hashing
Pick P hash functions hk: {1,…,n}{1,…,O(n2)} [Fingerprint]
Ordering under hk gives random row permutation
One-pass Implementation
For each Ci and hk, keep “slot” for min-hash value
Initialize all slot(Ci,hk) to infinity
Scan rows in arbitrary order looking for 1’s
Suppose row Rj has 1 in column Ci
For each hk,
if hk(j) < slot(Ci,hk), then slot(Ci,hk)  hk(j) Example C1 C2 R1 1 0 R2 0 1 R3 1 1 R4 1 0 R5 0 1 h(x) = x mod 5 g(x) = 2x+1 mod 5 h(1) = 1 1 - g(1) = 3 3 - h(2) = 2 1 2 g(2) = 0 3 0 h(3) = 3 1 2 g(3) = 2 2 0 h(4) = 4 1 2 g(4) = 4 2 0 h(5) = 0 1 0 g(5) = 1 2 0 C1 slots C2 slots Comparing Signatures Signature Matrix S Rows = Hash Functions Columns = Columns Entries = Signatures Compute – Pair-wise similarity of signature columns Problem MinHash fits column signatures in memory But comparing signature-pairs takes too much time Technique to limit candidate pairs? A-Priori does not work Locality Sensitive Hashing (LSH) Locality-Sensitive Hashing Partition signature matrix S b bands of r rows (br=P) Band Hash Hq: {r-columns}{1,…,k} Candidate pairs – hash to same bucket at least once Tune – catch most similar pairs, few nonsimilar pairs Bands H3 Example Suppose m=100,000 columns Signature Matrix Signatures from P=100 hashes Space – total 40Mb Number of column pairs – total 5,000,000,000 Band-Hash Tables Choose b=20 bands of r=5 rows each Space – total 8Mb Band-Hash Analysis Suppose sim(Ci,Cj) = 0.8 P[Ci,Cj identical in one band]=(0.8)^5 = 0.33 P[Ci,Cj distinct in all bands]=(1-0.33)^20 = 0.00035 Miss 1/3000 of 80%-similar column pairs Suppose sim(Ci,Cj) = 0.4 P[Ci,Cj identical in one band] = (0.4)^5 = 0.01 P[Ci,Cj identical in >0 bands] < 0.01*20 = 0.2 Low probability that nonidentical columns in band collide False positives much lower for similarities << 40% Overall – Band-Hash collisions measure similarity Formal Analysis – later in near-neighbor lectures LSH Summary Pass 1 – compute singature matrix Band-Hash – to generate candidate pairs Pass 2 – check similarity of candidate pairs LSH Tuning – find almost all pairs with similar signatures, but eliminate most pairs with dissimilar signatures Densifying – Amplification of 1’s Dense matrices simpler – sample of P rows serves as good signature Hamming LSH construct series of matrices repeatedly halve rows – ORing adjacent row-pairs thereby, increase density Each Matrix select candidate pairs between 30–60% 1’s similar in selected rows Example 0 0 1 1 0 0 1 0 0 1 0 1 1 1 1 Using Hamming LSH Constructing matrices n rows  log2n matrices total work = twice that of reading original matrix Using standard LSH identify similar columns in each matrix restrict to columns of medium density Summary Finding frequent pairs A-priori  PCY (hashing)  multistage Finding all frequent itemsets Sampling  SON  Toivonen Finding similar pairs MinHash+LSH, Hamming LSH Further Work Scope for improved algorithms Exploit frequency counting ideas from earlier lectures More complex rules (e.g. non-monotonic, negations) Extend similar pairs to k-sets Statistical validity issues References Mining Associations between Sets of Items in Massive Databases, R. Agrawal, T. Imielinski, and A. Swami. SIGMOD 1993. Fast Algorithms for Mining Association Rules, R. Agrawal and R. Srikant. VLDB 1994. An Effective Hash-Based Algorithm for Mining Association Rules, J. S. Park, M.-S. Chen, and P. S. Yu. SIGMOD 1995. An Efficient Algorithm for Mining Association Rules in Large Databases , A. Savasere, E. Omiecinski, and S. Navathe. The VLDB Journal 1995. Sampling Large Databases for Association Rules, H. Toivonen. VLDB 1996. Dynamic Itemset Counting and Implication Rules for Market Basket Data, S. Brin, R. Motwani, S. Tsur, and J.D. Ullman. SIGMOD 1997. Query Flocks: A Generalization of Association-Rule Mining, D. Tsur, J.D. Ullman, S. Abiteboul, C. Clifton, R. Motwani, S. Nestorov and A. Rosenthal. SIGMOD 1998. Finding Interesting Associations without Support Pruning, E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J.D. Ullman, and C. Yang. ICDE 2000. Dynamic Miss-Counting Algorithms: Finding Implication and Similarity Rules with Confidence Pruning, S. Fujiwara, R. Motwani, and J.D. Ullman. ICDE 2000. s ³ [ ] ( ) j i j i C , C sim ) h(C ) h(C P = = c ³ sup(X) ) A sup(X ) A conf(X j j + = ® ) A sup(X ) A sup(X j j + = ® s sup(X) ³ X Y and X Ì s sup(Y) s sup(X) ³ Þ ³ s }) A , sup({A s ) sup(A j i i < Þ < C B A A ) C , sim(C j i + + = j i j i j i C C C C ) C , sim(C U I =