CS代写 Information Management

Information Management
Data Mining
(from the book Data Mining: Concepts and Techniques)

Copyright By PowCoder代写 加微信 powcoder

Università degli Studi di Mining
• What is data mining?
It is a set of techniques and tools aimed at extracting interesting patterns from data
• Data mining is part of KDD (knowledge discovery in databases)
• KDD is “the process of identifying valid, novel, potential useful, and ultimately
understandable patterns in data”
• Data warehouses are crucial for data mining purposes

Types of data mining
• Concept description
• Characterization: concise and succinct representation of a collection of data • Comparison: description comparing two (or more) collections of data
• Descriptive data mining
• Describes concepts or task-relevant data sets in a concise and informative
• Predictive data mining
• Builds a model based on data and on their analysis, the model is then used to predict trends and properties of unknown data

Data mining branches (1)
• Aggregation queries are a very simple kind of mining
• Classification
• Build a model to categorize data in classes
• Regression
• Build a model to predict the result of a real-valued function
• Clustering
• Organize data into groups of similar items
• Outlier detection
• Identify unusual data items

Data mining branches (2)
• Trend analysis and forecasting
• Identify changes in patterns of data over time
• Detect dependencies among data
• Identify whether attributes are correlated with each other • Identify which attributes likely occur together
• Temporal pattern detection (or time series mining) • Identify common patterns in time series

Data mining: be careful! (1)
• Overfitting
• Identify spurious patterns: be careful not to take coincidence for causality!
• May be due to the analysis of too many attributes or of a limited number of data items
• Example: ask 10.000 subjects to predict the color of 10 face-down cards, 10 subjects predicted correctly all the 10 cards
conclusion: 1 out 1.000 subjects have extra sensory perception NO
• Report “obvious” results that do not derive from data analysis • Example: women are more likely to have breast cancer

Data mining: be careful! (2)
• Confuse correlation and causation
• Data mining identifies correlated attributed, but this does not always imply
causality relationship!
• Example: overweight people are more likely to drink diet soda
Conclusion: diet soda causes obesity NO
• It is necessary to correctly interpret mining results
• Data mining algorithms are not magic
• Results must be carefully analyzed to avoid drawing wrong conclusions

Examples of application domains
• Market analysis
• Targeted marketing, customer profiling
• Determining patterns of purchases over time for suggestions • Cross market analysis
• Corporate analysis and risk management • Finance planning and asset evaluation
• Resource planning • Competition

Frequent itemset mining

Frequent patterns
• Frequent patterns: patterns (e.g., itemsets, subsequences, or substructures) that appear frequently in a data set
• Example: a set of items, such as milk and bread, that appear frequently together in a transaction data set is a frequent itemset
• Finding frequent patterns plays an essential role in mining associations, correlations, and many other interesting relationships among data
• Application examples: market basket analysis, cross-marketing, catalog design, loss-leader analysis, clustering, classification, etc.

• Let I={i1,…,in} be a collection of one or more data items
• Transaction T ⊆ I is a subset of items in I
• Itemset X ⊆ I is a subset of items
• k-Itemset X: itemset with k items
• Absolute support of X: frequency of occurrence of
• Relative support: fraction of transactions that include X (probability that X appears in a transaction)
• Frequent itemset: itemset with support above a threshold
Customer buys milk
Bread, Milk
Bread, Coffee, Milk
Bread, Cocoa, Milk
Cocoa, Coffee, Milk
Bread, Cocoa, Coffee, Milk
Customer buys bread
Customer buys both
Items bought

Association rules
• Association rule AàB where A and B are itemsets
• Support s:
fraction of transactions including both A and B
probability P(A ⋃ B)
{Bread, Milk}àCoffee
Items bought
Bread, Milk
Bread, Coffee, Milk
Bread, Cocoa, Milk
Cocoa, Coffee, Milk
Bread, Cocoa, Coffee, Milk
• Confidence c:
fraction of transactions including A that also include B
probability P(B|A)

Association rules
• Association rule AàB where A and B are itemsets
• Support s:
fraction of transactions including both A and B
probability P(A ⋃ B)
{Bread, Milk}àCoffee
s = |{$%&'(,*+,,&&,-./0}| = 3 = 0.4
c = |{$%&'(,*+,,&&,-./0}| = 3 = 0.5 |{$%&'(,-./0}| 8
Items bought
Bread, Milk
Bread, Coffee, Milk
Bread, Cocoa, Milk
Cocoa, Coffee, Milk
Bread, Cocoa, Coffee, Milk
• Confidence c:
fraction of transactions including A that also include B
probability P(B|A)

Items bought
Bread, Milk
Bread, Coffee, Milk
Bread, Cocoa, Milk
Cocoa, Coffee, Milk
Bread, Cocoa, Coffee, Milk
Strong rules
• Goal: given a set T of transactions, find strong association rules that have
• Support ≥ minsup threshold
• Confidence ≥ minconf threshold
• All the rules in the example are partitions of {Bread,Coffee,Milk}
• Have the same support
• Have different confidence

Items bought
Bread, Milk
Bread, Coffee, Milk
Bread, Cocoa, Milk
Cocoa, Coffee, Milk
Bread, Cocoa, Coffee, Milk
Strong rules
• Goal: given a set T of transactions, find strong association rules that have
• Support ≥ minsup threshold
• Confidence ≥ minconf threshold
• All the rules in the example are partitions of {Bread,Coffee,Milk}
• Have the same support
• Have different confidence
{Bread,Milk}à{Coffee} s=0.4, c=0.5 {Bread,Coffee}à{Milk} s=0.4, c=1.0 {Milk,Coffee}à{Bread} s=0.4, c=1.0 {Bread}à{Coffee,Milk} s=0.4, c=0.5 {Coffee}à{Bread,Milk} s=0.4, c=0.67 {Milk}à{Bread,Coffee} s=0.4, c=0.5

Association rule mining
Association rule mining can operate in two steps
1. Find all frequent itemsets
• Determines the performance of the approach
2. Generate strong association rules from frequent itemstets
• Much less costly
Note: a frequent itemset of length 100 includes 100 1-frequent itemsets, 100!/2!98! = 100/2 = 50 2-frequent itemsets … overall, it includes nearly 1.27*1030 frequent patterns!
Too many!!

Itemset lattice
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE

Closed and maximal itemsets
• X is a closed itemset if there exists no proper super-itemset Y with the same absolute support (i.e., frequency) as X
• X is a closed frequent itemset if it is a closed itemset with frequency above minfreq
• X is a maximal frequent itemset (max-itemset) if it is frequent and there exists no super-itemset Y such that X ⊂ Y and Y is frequent
• X is frequent and is not included in another frequent itemset

Apriori property
• Apriori property: all non- empty subsets of a frequent itemset must also be frequent
• If I is not frequent, adding i to I produces an itemset I ⋃ {i} that cannot appear more frequently than I
• Antimonotonicity: if a set does not pass a test, also its supersets will fail the same test
not frequent null
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE
BCE BDE CDE

Apriori algorithm (1)
• Scan all transactions and count the occurrences of each item (1-itemsets in candidate set C1)
• Select the 1-itemsets in C1 satisfying minsup threshold, and include them in L1
• Compute the set of candidates frequent 2-itemsets C2 as L1⋈L1
• Scan all transactions to compute the absolute support of each
2-itemset in C2
• Select the 2-itemsets in C2 satisfying minsup threshold, and include them in
• Compute the set of candidates frequent 3-itemsets C3 as L2⋈L2 •…
• Stop when Ci is empty

Apriori algorithm (2)
Ck: Candidate itemset of size k Lk : frequent itemset of size k
L1 = {frequent 1-itemsets}
for (k = 2; Lk-1 !=Æ; k++) do begin
Ck = generate candidates Lk-1
for each transaction t in database do
increment the count of all candidates in Ck that are contained in t Lk = candidates in Ck with support higher than minsup
return Èk Lk

Generate candidates
• To generate candidates Ck, compute a self-join between Lk-1 and Lk-1
• Join condition: the first k-2 items should be in common
• Items l1 and l2 in Lk-1 are joined if
• (l1[1]=l2[1]) ^ (l1[2]=l2[2]) ^ … ^ (l1[k-2]=l2[k-2]) • (l1[k-1]?(A ⋃ C) :;<<=>?(A)
• Association rules can be generated as follows:
1. For each frequent itemset l, generate all non-empty subsets of l
2. For every non-empty subset s of l, generate rule sà(l-s) if support(A ⋃ B) ≥ minconf
support(A)
Note: minimum support is guaranteed by the fact that l is a frequent itemset

Generating association rules: example
• Consider frequent itemset X={I1, I2, I5}
• Non-empty subsets of X are
• {I1,I2} generated rule {I1,I2}àI5 with confidence 2/4 • {I1,I5} generated rule {I1,I5}àI2 with confidence 2/2 • {I2,I5} generated rule {I2,I5}àI1 with confidence 2/2 • {I1} generated rule I1à{I2,I5} with confidence 2/6
• {I2} generated rule I2à{I1,I5} with confidence 2/7
• {I5} generated rule I5à{I2,I5} with confidence 2/2
• minsup=0.7
List of item IDS
I1,I2,I3,I5

Improving the Efficiency of Apriori
• Major computational challenges
• Multiple scans of transaction database
• Huge number of candidates
• Tedious workload of support counting for candidates
• Improving Apriori: general ideas
• Reduce passes of transaction database scans • Shrink number of candidates
• Facilitate support counting of candidates

Partition: Scan Database Only Twice
• Any itemset that is potentially frequent in DB must be frequent in at least one of the partitions of DB
• Scan 1: partition database and find local frequent patterns • Scan 2: consolidate global frequent patterns
• A. Savasere, E. Omiecinski and S. Navathe, VLDB’95
DB1 + DB2 + + DBk = DB sup1(i) < σDB1 sup2(i) < σDB2 supk(i) < σDBk sup(i) < σDB DHP: Reduce the Number of Candidates • A k-itemset whose corresponding hashing bucket count is below the threshold cannot be frequent • Candidates: a, b, c, d, e • Hash entries • {ab, ad, ae} • {bd, be, de} •... • Frequent 1-itemset: a, b, d, e • ab is not a candidate 2-itemset if the sum of count of {ab, ad, ae} is below support threshold • J. Park, M. Chen, and P. Yu. An effective hash-based algorithm for mining association rules. SIGMOD’95 Hash Table {ab, ad, ae} {bd, be, de} {yz, qs, wt} Sampling for Frequent Patterns • Select a sample of original database, mine frequent patterns within sample using Apriori • Scan database once to verify frequent itemsets found in sample, only borders of closure of frequent patterns are checked • Example: check abcd instead of ab, ac, ..., etc. • Scan database again to find missed frequent patterns • H. Toivonen. Sampling large databases for association rules. In VLDB’96 DIC: Reduce Number of Scans AB AC BC AD BD CD • Once both A and D are determined frequent, the counting of AD begins • Once all length-2 subsets of BCD are determined frequent, the counting of BCD begins Itemset lattice S. . Motwani, J. Ullman, and S. Tsur. Dynamic itemset counting and implication rules for market basket data. SIGMOD’97 Transactions 1-itemsets 2-itemsets 1-itemsets FP-tree (1) 1. Scan the DB (similarly to Apriori) to find frequent 1-itemsets 2. Sort frequent items in frequency descending order (f-list) 3. Scan DB again, construct FP-tree f-list: f, c, a, m, b, p Transactional Database List of item IDS f, a, c, d, g, i, m, p a, b, c, f, l, m, o b, f, h, j, o, w b, c, k, s, p a, f, c, e, l, p, m, n Support count List of item IDS Ordered items f, a, c, d, g, i, m, p f, c, a, m, p a, b, c, f, l, m, o f, c, a, b, m b, f, h, j, o, w b, c, k, s, p a, f, c, e, l, p, m, n f, c, a, m, p FP-tree (2) • Root: empty node {} • For each transaction, order its items according to the f-list • Each transaction is a path in the FP- tree • Build a path for each transaction, increasing by one the counter in common sub-paths c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Support count FP-tree: example (1) List of item IDS f, a, c, d, g, i, m, p a, b, c, f, l, m, o b, f, h, j, o, w b, c, k, s, p a, f, c, e, l, p, m, n List of item IDS Ordered items f, a, c, d, g, i, m, p f, c, a, m, p a, b, c, f, l, m, o f, c, a, b, m b, f, h, j, o, w b, c, k, s, p a, f, c, e, l, p, m, n f, c, a, m, p Support count FP-tree: example (2) List of item IDS Ordered items f, a, c, d, g, i, m, p f, c, a, m, p a, b, c, f, l, m, o f, c, a, b, m b, f, h, j, o, w b, c, k, s, p a, f, c, e, l, p, m, n f, c, a, m, p FP-tree: example (3) a:2 m:1 b:1 List of item IDS Ordered items f, a, c, d, g, i, m, p f, c, a, m, p a, b, c, f, l, m, o f, c, a, b, m b, f, h, j, o, w b, c, k, s, p a, f, c, e, l, p, m, n f, c, a, m, p FP-tree: example (4) f:3 c:2 b:1 a:2 m:1 b:1 List of item IDS Ordered items f, a, c, d, g, i, m, p f, c, a, m, p a, b, c, f, l, m, o f, c, a, b, m b, f, h, j, o, w b, c, k, s, p a, f, c, e, l, p, m, n f, c, a, m, p FP-tree: example (5) c:1 c:2 b:1 b:1 a:2 p:1 m:1 b:1 p:1 m:1 List of item IDS Ordered items f, a, c, d, g, i, m, p f, c, a, m, p a, b, c, f, l, m, o f, c, a, b, m b, f, h, j, o, w b, c, k, s, p a, f, c, e, l, p, m, n f, c, a, m, p FP-tree: example (6) c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 List of item IDS Ordered items f, a, c, d, g, i, m, p f, c, a, m, p a, b, c, f, l, m, o f, c, a, b, m b, f, h, j, o, w b, c, k, s, p a, f, c, e, l, p, m, n f, c, a, m, p FP-tree: pattern base • The problem of mining frequent patterns in databases is transformed to that of mining the FP-tree • Extract the conditional pattern base by visiting the tree c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Support count Conditional pattern base fca:2 fcab:1 fca:1 f:1 c:1 fcam:2 cb:1 Conditional FP-tree • For each pattern-base • Accumulate the count for each item in the base • Construct the FP-tree for the frequent items of the pattern base • Generate frequent patterns Conditional pattern base fca:2 fcab:1 fca:1 f:1 c:1 fcam:2 cb:1 f:3 c:3 a:3 fm, fcm, fcam Support count below threshold • FP-growth transforms the problem of finding long frequent patterns to searching for shorter ones recursively and concatenating the suffix • It uses the least frequent suffix offering a good selectivity • It reduces the search cost • If the tree does not fit into main memory, partition the database • Efficient and scalable for mining both long and short frequent patterns Which patterns are interesting? • Minimum support and confidence thresholds help to exclude uninteresting rules • When mining low support thresholds or long patterns, we may end up with uninteresting rules • Is a rule interesting? • Subjective measure: only users can ultimately say if a rule is interesting • Objective measure: may help filtering uninteresting rules based on statistics behind data Misleading association rules • Consider an electronic shop and, in particular, transactions including computer games and videos • Out of 10,000 transactions • 6,000 include games • 7,500 include videos • 4,000 include both games and videos • Rule {games}à{video} has support=0.4 and confidence=0.66 • It is considered strong if the threshold is fixed to 0.3 • Misleading!! computer games and videos are negatively associated: the purchase of one decreases the likelihood of purchasing the other Correlation analysis • The confidence of a rule AàB can be deceiving • estimate of the conditional probability of itemset B given itemset A • does not measure the real strength of the correlation implication between A and B • Need to adopt correlation analysis • There exist different correlation measures • Lift • Chi-square •... • The occurrence of A is independent of the occurrence of B if P A⋃B =P A P B otherwiseAandBaredependentand correlated as events lift(A,B) = T(A ⋃ C) T A T(C) • lift(A,B)<1: A is negatively correlated with B (occurrence of A likely leads to absence of B) • lift(A,B)>1: A is positively correlated with B
(occurrence of A likely leads to occurrence of B)
• lift(A,B)=1: A and B are independent

Lift: example
• Out of 10,000 transactions
• 6,000 include games
• 7,500 include videos
• 4,000 include both games and videos
• Rule {games}à{video} • support=0.4
• confidence=0.66
• lift = P({game,video})/(P({game})P({video})) = 0.4/(0.6*0,75)=0.89 < 1 Classification What is a classification? • Examples: • Analyze data to learn which loan applicants are risky and which are safe • Analyze breast cancer data to predict which of the three specific treatments is better • The classification task is aimed at predicting the class (category or label) of data • It is a prediction problem • Classes are discrete values that are not characterized by an order relationship Classification: a two-step process Learning (training) step • Each tuple/sample is assumed to belong to a predefined class, as determined by the class label attribute • The set of tuples used for model construction is training set • The model is represented as classification rules, decision trees, or mathematical formulae Classification step • Estimate accuracy of the model • Theknownlabeloftestsampleiscomparedwiththeclassifiedresultfromthemodel • Accuracyrateisthepercentageoftestsetsamplesthatarecorrectlyclassifiedbythemodel • Testsetisindependentoftrainingset(otherwiseoverfitting) • If the accuracy is acceptable, use the model to classify new data Learning step: example Training Data Classification Algorithms Classifier (Model) Assistant Prof IF rank = ‘professor’ OR years > 6
THEN tenured = ‘yes’

Classification step: example
Testing Data
Unseen Data
(Jeff, Professor, 4)
Classifier
Assistant Prof
Associate Prof

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com