Data Mining (EECS 4412)
Sequential Pattern Mining
Parke Godfrey
EECS
Lassonde School of Engineering York University
Thanks to
Professor Aijun An
for creation & use of these slides.
2
Outline
Basic concepts of sequential pattern mining A Simplified Version of GSP Algorithm PrefixSpan
3
An Example Sequence Database
A sequence database consists of a set of sequences
A sequence is an ordered list of itemsets
Itemset: a set of items Items within an itemset are
unordered.
Element: an itemset in a sequence
4
Example of a Sequential Pattern
Database of an online book store Contains data sequences
Each sequence corresponds to a list of transactions by a given customer.
Each transaction contains the books selected by the customer in one order.
A sequential pattern
5% of customers bought “Foundation”, then “Foundation
and Empire” and “Ringworld”, then “Second Foundation”.
Usefulness
Making recommendations. Knowing what to stock.
5
Mining Sequential Patterns
Objective
Finding (interesting) frequent sequences from a sequence database.
Applications
Customer shopping sequence analysis
Web log analysis
DNA or protein analysis
Stock market analysis
Medical domain
Predict outset of a disease from a sequence of symptoms, etc.
Earthquake prediction etc.
6
Basic Concepts
A sequence áa1, a2, …, anñ is contained in (or is a subsequence of) another sequence áb1, b2, …, bmñ if there exist integers i1
Frequent sequences (also called sequential patterns) sequences that satisfy a minimum support (min_sup).
Customer Id
Customer Sequence
1 2 3 4 5
á(30) (90)ñ
á(10 20) (30) (20 60 70)ñ á(30 50 70)ñ
á(30) (40 60 70) (90)ñ á(90)ñ
7
Sequential Pattern Mining
What is sequential pattern mining
Given a sequence database, find the set of frequent sequences that satisfy a minimum support (min_sup).
Algorithms
Initial Apriori-like algorithms (Agrawal and Srikant, 95)
AprioriAll, AprioriSome, and DynamicSome
GSP – an Apriori-like, influential mining method (Srikant and Agrawal, 96)
PrefixSpan (Pei, et al, 01)
SPADE (Zaki, 01)
Mining sequential patterns with constraints (Pei, et al, 02) etc.
8
Apriori Property
The Apriori property in sequential patterns:
Any nonempty subsequence of a frequent sequence must be frequent
If a sequence is infrequent, then none of its super-sequences is frequent.
i.e., if á(3) (4 5)ñ is infrequent, so are á(3) (4 5) (8)ñ and á(3 6) (4 5)ñ
9
GSP
GSP: Generalized Sequential Pattern Mining
Proposed in
R. Srikant and R. Agrawal. Mining Sequential Patterns: Generalizations and Performance Improvements. In Proc. of EDBT’96.
Can be downloaded from the course web site
GSP considers some time constraints and item taxonomy
More general than simply mining sequential patterns
10
A Simplified Version of GSP
GSP without the generalizations
Problem statement:
find all frequent sequences from a database of sequences
given a min_sup
Length of a sequence = number of items in the sequence
Length of <(a)(b)> is 2 Length of <(a b)> is 2 Length of <(a b) (c)> is 3
A length-k sequence is also called k-sequence.
11
General Description of Simplified GSP
Method
Take sequences in form of <(x)> as length-1 candidates
Scan database once, find L1, the set of length-1 sequential patterns
Let k=1; while Lk is not empty do
Form Ck+1, the set of length-(k+1) candidates from Lk;
If Ck+1 is not empty, scan database once, find Lk+1, the set of length-(k+1) sequential patterns
Let k=k+1;
12
Finding Length-1 Sequential Patterns
Initial candidates
<(a)>, <(b)>, <(c)>, <(d)>, <(e)>, <(f)>, <(g)>, <(h)>
Scan database once
count support for candidates
min_sup =40%
min_sup_count =2
Cand
Sup
<(a)>
3
<(b)>
5
<(c)>
4
<(d)>
3
<(e)>
3
<(f)>
2
<(g)>
1
<(h)>
1
Seq-id
Sequence
10
<(bd)(c)(b)(ac)>
20
<(bf)(ce)(b)(fg)>
30
<(ah)(bf)(a)(b)(f)>
40
<(be)(ce)(d)>
50
<(a)(bd)(b)(c)(b)(ade)>
13
Length-2 Candidate Generation
How to generate C2 from L1
Merge every pair of frequent length-1 sequences.
Merging two frequent length-1 sequences <(i1)> and <(i2)> will produce three candidate 2-sequences:
<(i1) (i2)> <(i2) (i1)> and <(i1 i2)> assuming that i1 is different from i2.
Every frequent length-1 sequence <(i)> can join with itself to produce <(i)(i)>.
14
Length-2 Candidates (Cont’d from the example on Slide 13)
Length of a sequence= the number of items
<(a)(a)>
<(a)(b)>
<(a)(c)>
<(a)(d)>
<(a)(e)>
<(a)(f)>
<(b)(a)>
<(b)(b)>
<(b)(c)>
<(b)(d)>
<(b)(e)>
<(b)(f)>
<(c)(a)>
<(c)(b)>
<(c)(c)>
<(c)(d)>
<(c)(e)>
<(c)(f)>
<(d)(a)>
<(d)(b)>
<(d)(c)>
<(d)(d)>
<(d)(e)>
<(d)(f)>
<(e)(a)>
<(e)(b)>
<(e)(c)>
<(e)(d)>
<(e)(e)>
<(e)(f)>
<(f)(a)>
<(f)(b)>
<(f)(c)>
<(f)(d)>
<(f)(e)>
<(f)(f)>
51 length-2 Candidates
<(ab)>
<(ac)>
<(ad)>
<(ae)>
<(af)>
<(bc)>
<(bd)>
<(be)>
<(bf)>
<(cd)>
<(ce)>
<(cf)>
<(de)>
<(df)>
<(ef)>
Without Apriori
property,
8*8+8*7/2=92
candidates
Apriori prunes
44.57% candidates
15
Finding Length-2 Sequential Patterns
Scan database one more time, collect support count for each length-2 candidate
There are 19 length-2 candidates which pass the minimum support threshold in our example DB
They are length-2 sequential patterns:
<(a)(a)>, <(a)(b)> <(b)(a)>,<(b)(b)>,<(b)(c)>,<(b)(d)>,<(b)(e)>,<(b)(f)> <(c)(a)>,<(c)(b)>,<(c)(d)>
<(d)(a)>,<(d)(b)>,<(d)(c)>
<(f)(b)>,<(f)(f)>
<(bd)>,<(bf)>,<(ce)>
16
Generating Ck from Lk-1 (k>2)
Join step: generate Ck by joining Lk-1 with Lk-1 All items in an itemset are ranked in an order.
A sequence s1 joins with s2 if the subsequence obtained by dropping the first item of s1 is the same as the subsequence obtained by dropping the last item of s2.
The joined sequence is s1 plus the last item of s2.
The added item becomes a separate element if it was a separate element in s2, or part of the last element of s1 otherwise.
Examples:
Joining <(1)(2 3)(4)> and <(2 3)(4)(5)> produces <(1) (2 3)(4)(5)>
Joining <(1)(2 3)(4)> and <(2 3)(4 5)> produces <(1)(2 3)(4 5)>
17
Generating Ck from Lk-1 (k>2) (Cont’d)
Prune step: delete candidates in Ck that have infrequent (k-1)-subsequence
A (k-1)-subsequence of sequence s is derived by dropping an item from s
Check each (k-1)-subsequence against Lk-1
Example:
If <(ab)(d)>, <(b)(ad)>, <(b)(de)> are all length-3 frequent sequences, then
Join result: <(ab)(de)>
Its length-3 subsequences:
<(b)(de)>, <(a)(de)>, <(ab)(e)>, <(ab)(d)>
<(a)(de)> and <(ab)(e)> are infrequent, <(ab)(de)> is pruned As long as one of them is infrequent, can stop checking others.
No need to check <(ab)(d)> and <(b)(de)> (Why?)
C4 becomes empty
18
Generating Ck from Lk-1 (k>2) (Cont’d)
More Examples:
If <(a)(b)>, <(a)(a)> and <(b)(a)> are all length-2 sequential patterns, then length-3 candidates are:
Join result:
<(a)(b)(a)> <(a)(a)(b)>, <(a)(a)(a)>, <(b)(a)(b)>, and <(b)(a)(a)>.
After pruning:
<(a)(b)(a)>, <(a)(a)(b)>, <(a)(a)(a)>, <(b)(a)(a)>.
If <(bd)>, <(b)(b)> and <(d)(b)> are all length-2 sequential patterns, what are the length-3 candidates?
Join result:
<(bd)b>, <(b)(bd)>, <(b)(b)(b)>, <(d)(bd)>, <(d)(b)(b)>
After pruning:
<(bd)b>, <(b)(b)(b)>, <(d)(b)(b)>
19
Example Continued
L4: Length-4 sequential patterns: <(b)(c)(b)(a)>
<(bd)(b)(a)>
<(bd)(b)(c)>
<(bd)(c)(a)> <(bd)(c)(b)> <(bf)(b)(f)> <(d)(c)(b)(a)>
min_sup_count=2
Seq-id
Sequence
10
<(bd)(c)(b)(ac)>
20
<(bf)(ce)(b)(fg)>
30
<(ah)(bf)(a)(b)(f)>
40
<(be)(ce)(d)>
50
<(a)(bd)(b)(c)(b)(ade)>
C5: Length-5 candidates (after join and prune): <(bd)(c)(b)(a)>
L5: Length-5 sequential pattern (found by scanning DB):
<(bd)(c)(b)(a)>: 2
20
Summary of Simplified GSP
Make the first pass over the sequence database D to find all the 1-element (length-1) frequent sequences
Repeat until no new candidate or frequent sequences are found
Candidate Generation:
Merge joinable pairs of frequent sequences of length (k-
1) to generate candidate sequences that contain k items
Prune candidate k-sequences that contain infrequent (k-1)- subsequences
Support Counting and Candidate Elimination:
Make a new pass over the sequence database D to find the
support count for each candidate sequence
Eliminate candidate k-sequences whose actual support is less than min_sup 21
Bottlenecks of GSP
A huge set of candidates
1,000 frequent length-1 sequences generate
1000 ́1000+1000 ́999 =1,499,500 2
length-2 candidates! Multiple scans of database in mining
Real challenge: mining long sequential patterns An exponential number of short candidates
A length-100 sequential pattern needs 1030
candidate sequences!
100 æ100ö
åç ÷=2 -1»10
100 30 i=1è ø
çi÷
22
Better Method: PrefixSpan
Generate all frequent sequences without candidate generation and testing.
J. Pei et al. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. In Proc. of ICDE’01.
Will talk about it next.
23
PrefixSpan
Generate all frequent sequences without candidate generation and testing.
Strategy
Divide and conquer
Divide the patterns to be mined into subsets and find patterns in each subset recursively.
Projection-based
Recursively project a sequence database into a set of smaller databases based on the frequent “prefix” mined so far
Mine each projected database to find frequent “suffixes”
24
Prefix and Suffix
Suppose all the items in an element (i.e. an itemset) of a sequence are listed alphabetically.
Prefixes of sequence are ,
Suffixes of sequence :
<(abc)(ac)d(cf)> is the suffix wrt prefix <(_bc)(ac)d(cf)> is the suffix wrt prefix
25
Mining Sequential Patterns by Prefix Projections
Step 1: find length-1 sequential patterns , ,
Step 2: divide search space. The complete set of freq. seq. can be partitioned into 6 subsets:
The ones having prefix ; The ones having prefix ; …
The ones having prefix
SID
sequence
10
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
min_sup count=2
26
Finding Freq. Seq. with Prefix
Only need to consider sequences containing
In a sequence containing , only the sub- sequence (suffixes) prefixed with the first occurrence of should be considered.
The collection of such subsequences is called -projected database:
SID
sequence
10
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>.
min_sup count=2
Next: recursively mine -projected database to find frequent sequences in the projected database.
27
Finding Freq. Seq. with Prefix (Cont’d) Find local frequent length-1 sequences by scanning
-projected database
-projected database: <(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>
Local frequent length-1 sequences:
:2, :4, <(_b)>:2,
min_sup count=2
Generate all the length-2 freq. seq. having prefix :
Further partition frequent sequences with prefix into 6 subsets
Having prefix
Having prefix
28
Finding Freq. Seq. with Prefix
From -projected database:
<(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>
Generate
No local frequent items, stop growing prefix
min_sup count=2
SID
sequence
10
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
29
Finding Freq. Seq. with Prefix
From -projected database:
<(abc)(ac)d(cf)>, <(_d)c(bc)(ae)>, <(_b)(df)cb>, <(_f)cbc>
Generate
Local frequent length-1 sequences :2,
Generate length-3 freq. seq. with prefix
Further partition the set of feq. Seq. prefixed with
Sequences with prefix
Sequences with prefix
30
Finding Freq. Seq. with Prefix
From
Generate
No local frequent length-1 sequence, stop growing
31
Finding Freq. Seq. with Prefix
From
Generate
No local frequent length-1 sequence, stop growing
32
Finding Freq. Seq. with Prefix
From
Generate -projected database: <(ac)d(cf)>, <(ae)>
Local frequent length-1 sequences: :2
Generate length-4 freq. seq. with prefix : :2
Further mining -projected database returns no frequent sequence prefixed with .
This finishes generating all the patterns prefixed with
33
Completeness of PrefixSpan
SDB
SID
sequence
10
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
Having prefix
Having prefix
Length-2 sequential Patterns starting with a:
Length-1 sequential patterns , ,
Having prefix
……
-projected database <(abc)(ac)d(cf)> <(_d)c(bc)(ae)> <(_b)(df)cb> <(_f)cbc>
-projected database
Having prefix
Having prefix
34
Efficiency of PrefixSpan
No candidate sequence needs to be generated
Projected databases keep shrinking
Major cost of PrefixSpan: constructing projected databases.
can be improved by Pseudo-projections
35
Speed-up by Pseudo-projection
Major cost of PrefixSpan: projection
Suffixes of a sequence often appear repeatedly in recursively projected databases
<(abc)(ac)d(cf)> appears in -projected database <(_c)(ac)d(cf)> appears in
When (projected) database can be held in
main memory, use pointers to form projections
Pointer to the sequence
s=
Offset of the suffix
s|: ( , 2) <(abc)(ac)d(cf)>
s|
Pseudo-Projection vs. Physical Projection
Pseudo-projection avoids physically copying suffixes
Efficient in running time and space when database can be held in main memory
However, it is not efficient when database cannot fit in main memory
Disk-based random accessing is very costly
Suggested Approach:
Integration of physical and pseudo-projection
Swapping to pseudo-projection when the projected database fits in memory
37
Experiments and Performance Analysis
Comparing PrefixSpan with GSP, FreeSpan and SPADE in large databases
GSP (IBM Almaden, Srikant & Agrawal EDBT’96)
FreeSpan (J. Han J. Pei, B. Mortazavi-Asi, Q. Chen, U. Dayal, M.C. Hsu, KDD’00)
SPADE (Zaki, 01)
PrefixSpan is fastest and scalable.
38
Problem of Sequential Pattern Mining
GSP and PrefixSpan finds all the sequences that satisfy the support threshold.
A long frequent sequence contains a combinatorial number of frequent subsequences:
For a length-100 sequential pattern, there exist 2100- 1 nonempty subsequences.
Problem: too many patterns are generated.
39
Solutions
Mining maximal or closed sequential patterns Maximal sequential pattern:
A frequent sequence s is maximal if there exists no frequent super-sequence of s.
Closed sequential pattern:
A sequence s is closed if there exists no super-sequence of
s with the same support as s.
CloSpan (Yan, Han and Afshar, 2003): an efficient closed sequential pattern mining method.
40
Solutions (Cont’d)
Constraint-based mining of sequential patterns
Constraints
Item constraint
Find patterns containing a, b, c, but no d.
Length constraint
Find patterns having at least (or at most) 10 items
Super-pattern constraint
Find patterns that contain <(PC)(Digital-camera)>
Aggregate constraint
Find patterns the average price of whose items is over $100.
41
Solutions (Cont’d)
More constraints
Regular expression constraint
Duration constraint
Find patterns of events about +/- 24 hours of a shooting
Gap constraint
Find purchasing patterns such that the gap between each consecutive purchases is less than 1 month.
42
Solutions (Cont’d)
Properties of constraints Anti-monotonic constraint
If a sequence s satisfies constraint C, so does every non-empty subsequence of s
Examples: support(s) >=5%, length(s)<10 Monotonic constraint
If a sequence s satisfies constraint C, so does every super sequence of s. Examples: length(s)>=10, super pattern constraints.
A systematic study on constraint-based sequential pattern ming:
J. Pei, J. Han, and W. Wang. “Mining Sequential Patterns with Constraints in Large Databases“. In Proceedings of CIKM’02.
43
More Recent Research
on Frequent Pattern Mining
Mining other kinds of frequent patterns
Frequent tree/graph mining
Find common structural components
Examples of applications
XML documents can be modeled as trees
Molecule or biochemical structures can be modeled as graphs Web connection structures can be modeled as graphs
Finding frequent patterns from data streams Only one scan of DB is allowed.
44
More Recent Research on Frequent Pattern Mining (Cont’d)
Finding high-utility patterns (either itemsets or sequences)
Consider the quantity of the item inside a transaction. Consider the importance (e.g., price) of an item
Find patterns whose revenue is at least, e.g., $1000
45
Frequent Pattern Mining Resources
Web site: http://fimi.cs.helsinki.fi/ contains some frequent itemset mining implementations, datasets and papers.
SPMF: open-source data mining library containing frequent sequence and itemset mining:
http://www.philippe-fournier-viger.com/spmf/
46
Next Class
Decision tree learning (Sections 8.1 and 8.2 in Chapter 8)
47