CS699 Lecture 1 Introduction
• Our focus is “data mining” not “data warehousing.”
• Will discuss
– Data preprocessing
CS699
• Data mining is an important component of data analysis.
– Basic data mining algorithms
– How to evaluate data mining models and data mining results – How to perform data mining using software tools
• A good data miningweb site: kdnuggets.com
• A good dataset site: UCI Machine Learning Repository
2
CS699
– CS546 and either CS669 or CS579, or instructor’s consent.
• Prerequisites:
• Mathrequirements
– Math is a tool to describe algorithms
– Mostly basic algebra (not linear algebra) and basic probabilities and statistics
– A little bit of calculus
– You will have to do calculations using a calculator (which has a “log” function)
3
CS699
• YouwillpracticedataminingwithWekaandJMPPro
• Weka: – Free
– Easy to learn and easy to use
– Has a large number of data mining algorithms – You will use it almost immediately
– Also used for class project
4
• FreelyavailablefromBU’sITwebsite(referto homework 1)
CS699
• JMPPro:Astatisticalanalysissoftwarewithsomedata mining algorithms implemented on it
5
– You will use primarily Weka
CS699
• Classproject:
– Building and testing classifier models using a real‐world
dataset
– You may use any other tools, including R, Python, or JMP Pro for data preprocessing
6
Why Data Mining?
• The Explosive Growth of Data: from terabytes to petabytes – Datacollectionanddataavailability
• Automated data collection tools, database systems, Web, computerized society
– Majorsourcesofabundantdata
• Business: Web, e‐commerce, transactions, stocks, …
• Science: Remote sensing, bioinformatics, scientific simulation, …
• Society and everyone: news, digital cameras, YouTube, social network
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets
7
What Is Data Mining?
• Data mining (knowledge discovery from data)
– Extraction of interesting (non‐trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data • Alternative names
– Knowledgediscovery(mining)indatabases(KDD),knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
• Watch out: Is everything “data mining”? – Simplesearchandqueryprocessing
– (Deductive) expert systems
8
Knowledge Discovery (KDD) Process
• This is a view from typical database systems and data warehousing communities
Pattern Evaluation
• Data mining plays an essential role in
the knowledge discovery process
Data Mining
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
9
Task-relevant Data
Increasing potential to support
business decisions
Decision Making
End User
Data Mining in Business Intelligence
Data Presentation
Business Analyst
Visualization Techniques
Data Mining Data
Information Discovery
Analyst
Data Exploration
Statistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses Data Sources
DBA
Paper, Files, Web documents, Scientific experiments, Database Systems
10
Input Data
Data Pre- Processing
Data Mining
Post- Processing
•
This is a view from typical machine learning and statistics communities
A Typical View from ML and Statistics
Data integration Normalization Feature selection Dimension reduction
Pattern discovery Association & correlation Classification
Clustering
Outlier analysis …………
Pattern evaluation Pattern selection Pattern interpretation Pattern visualization
11
What Kinds of Data?
• Database‐oriented data sets and applications
– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Data streams and sensor data
– Time‐series data, temporal data, sequence data (incl. bio‐sequences)
– Structure data, graphs, social networks and multi‐linked data
– Object‐relational databases
– Heterogeneous databases and legacy databases
– Spatial data and spatiotemporal data
– Multimedia database
– Text databases
– The World‐Wide Web
12
•
Data Types Categorical (or nominal) vs. numeric data:
Categorical
Numeric
Weight 1 15 60 180 2 8 48 115 3 32 72 153 4 27 65 145 5MiddleHighN 51758189
OID Age Income Buy? 1 Young Low Y
2 Young High Y
3 Old Low N 4 Middle Low Y
OID Age Height
6OldLowN 7 Young High N 8OldHighY 9OldHighY
6 56 7 72 8 22 9 42
70 150 56 163 63 172 71 139
10YoungLowN 103968150
13
Classification • Classification and label prediction
– Constructmodels(functions)basedonsometrainingexamples,called training dataset.
– Describeanddistinguishclassesorconceptsforfutureprediction
• E.g.,classifycountriesbasedon(climate),orclassifycarsbasedon(gas
mileage)
– Predictsomeunknownclasslabel(orclassattribute)
• Typical methods
– Decisiontrees,naïveBayesianclassification,supportvectormachines,neural networks, rule‐based classification, pattern‐based classification, logistic regression, …
• Typical applications:
– Creditcardfrauddetection,directmarketing,classifyingstars,diseases,
web‐pages, …
• Also called supervised learning
14
• Example (decision tree)
Classification
Classify a car with unknown class label (risk):
4‐door, 4‐cylinder, wagon. ==> risk = 1
15
Classification
• MusicCDpurchasedatasetexample
• Afictitiousdataset,where1’sand0’swereentered arbitrarily.
• Containsinformationaboutcustomers’purchaseof music CD’s collected over a certain period of time, say past 12 months.
• 1 in the dataset indicates the customer purchased a CD of the musician at least once in the past 12 months.
• Theclassattributeindicateswhetheracustomeris “young” or “old.”
16
50 tuples
• A part of the dataset
Classification
• 12attributes:1ID,10predictor(independent)attributes,1 class (dependent) attribute
17
Classification
• DecisiontreegeneratedbyJ48algorithm
18
Classification Example ASD Screening for Children
• Autistic Spectrum Disorder (ASD) Screening for Children
• Dataset downloaded from UCI ML Repository and slightly
modified.
• Using behavioral features (captured through AQ‐10 Child Version) and individual characteristics, to decide whether an individual should pursue formal clinical diagnosis.
• 292 tuples and 18 attributes
19
Classification Example ASD Screening for Children
• Attributes
– Attributes 1 – 10: 10 questions of AQ‐10
– Attribute 11 – 17: individual characteristics – Attribute 18: Class attribute; ASD Yes/No
20
Classification Example ASD Screening for Children
• J48 Decision Tree
– A model is built using a J48 decision tree algorithm (Weka) – The model is tested using 10‐fold cross‐validation method – Classification accuracy: 81.7%
Correctly classified 32 / 41 No tuples (78%) Correctly classified 53 / 63 Yes tuples (84.1%) Weighted average: 81.7%
21
• Decision Tree
Classification Example ASD Screening for Children
22
• Classification:
Classification vs. Numeric Prediction
– Predicted (dependent) attribute is a nominal attribute.
– Example: Predict whether a customer will buy a computer or not (yes or no, for example).
• Numeric prediction:
– Predicted (dependent) attribute is a numeric attribute.
– Example:Predicttheweight(numericvalue)ofapersongiventheageand the height of the person.
– Example:CPUdataset
23
• Example: CPU dataset
A part of the dataset
Numeric Prediction Example
24
Association and Correlation Analysis • Frequent patterns (or frequent itemsets)
– What items are frequently purchased together in a grocery store?
– Mine all frequent itemsets and then all strong rules.
– An itemset is frequent,
if its support is >= predefined threshold, minimum support
– A rule is written as:
– Example of a rule: {milk, butter} => {cheese, egg}
– A rule is strong,
if its confidence is >= predefined threshold, minimum confidence
25
Association and Correlation Analysis • Example:
A rule R: {bread} {egg, milk}
Quality measures of the rule and informal interpretation:
support of {bread} = 7 support of {egg, milk} = 4 support of {bread, egg, milk}
Support(R) = 33.3% (or 3/9) /* fraction of people who purchased {bread, milk, egg} Confidence(R) = 42.9% (or 3/7) /* among those who purchased bread, fraction of
/* people who also purchased {milk, egg}
26
Support Examples:
= 3
Association and Correlation Analysis
• Music CD purchase dataset (preprocessed for Weka’s Apriori algorithm)
• A part of the dataset
27
Association and Correlation Analysis
• Someassociationrules(minedbyApriorialgorithm)
• Handel=t Mahler=t 5 ==> Bach=t 5
• Bach=t Haydn=t Mendelssohn=t 5 ==> Mozart=t 5
• Bach=t Haydn=t Mozart=t 5 ==> Mendelssohn=t 5
• Bach=t Mendelssohn=t 7 ==> Mozart=t 6
• Bach=t Handel=t 6 ==> Mahler=t 5
• Bach=t Mozart=t Mendelssohn=t 6 ==> Haydn=t 5
• Haydn=t Mendelssohn=t 9 ==> Mozart=t 7
• Haydn=t Mozart=t 9 ==> Mendelssohn=t 7
• Bach=t Mozart=t 8 ==> Mendelssohn=t 6
• Mahler=t 14 ==> Bach=t 10
conf:(0.78)
28
Association and Correlation Analysis
• Association, correlation vs. causality
– Are strongly associated items also strongly correlated?
– If two items are strongly correlated, is there a causal relationship?
• How to mine such patterns and rules efficiently in large datasets?
• Association rules can also be used for classification or clustering.
29
Cluster Analysis
• Unsupervised learning (i.e., there is no class label)
• Group data to form new categories (i.e., clusters), e.g., cluster customers into different groups
• Principle: Maximizing intra‐class similarity & minimizing interclass similarity
• Many methods and applications
30
Cluster Analysis • Clustering output types:
31
Cluster Analysis
• London cholera epidemic (Source: J. Leskovec, A. Rajaraman, and J.D. Ullman, “Mining of Massive Datasets,” 2014, page 3.)
32
• Iris dataset (from UCI ML Repository)
• Used for classification
• Has 4 attributes and class attribute
• Class attribute: type of iris plant
• A part of the dataset
Cluster Analysis
33
Cluster Analysis
• A clustering algorithm was run on only two attributes
• Clustering result visualization
• X: petallength, Y: petalwidth
34
Outlier Analysis
– Outlier: A data object that does not comply with the general behavior of the data
– Noise or exception? ― One person’s garbage could be another person’s treasure
– Methods: byproduct of clustering or regression analysis, …
– Useful in fraud detection, rare events analysis
35
Sequential Pattern, Trend and Evolution Analysis
– Trend, time‐series, and deviation analysis: e.g., regression and value prediction
– Sequential pattern mining
• e.g., first buy digital camera, then buy large SD memory
cards
– Periodicity analysis
– Biological sequence analysis
36
Evaluation of Knowledge
• Are all mined knowledge interesting?
– Onecanminetremendousamountof“patterns”andknowledge
– Some may fit only certain dimension space (time, location, …)
– Somemaynotberepresentative,maybetransient,…
• A pattern is interesting if – easily understood
– validonnewdataortestdatawithsomedegreeofcertainty – potentially useful
– novel
• Objective measures (e.g., support and confidence of an association rule)
• Subjective measures (e.g., expected/unexpected, actionable) 37
Applications
Data Mining
Visualization
Machine Learning
Pattern Recognition
Statistics
Algorithm
Database Technology
High‐Performance Computing
Technologies Used in Data Mining
38
Applications of Data Mining
• Web page analysis: from web page classification, clustering to PageRank & HITS algorithms
• Collaborative analysis & recommender systems
• Basket data analysis to targeted marketing
• Biological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis
• Data mining and software engineering
• From major dedicated data mining systems/tools (e.g., SAS, MS SQL‐Server Analysis Manager, Oracle Data Mining Tools) to invisible data mining
39
Major Issues in Data Mining
• MiningMethodology
• UserInteraction
• EfficiencyandScalability • Diversityofdatatypes
• Dataminingandsociety
40
• http://www.cs.illinois.edu/~hanj/bk3/
References
• Han, J., Kamber, M., Pei, J., “Data mining: concepts and techniques,” 3rd Ed., Morgan Kaufmann, 2012
• Shmueli,G.,Bruce,P.C.,Stephens,M.L.,Patel,N.R., “Data mining for business analytics: concepts, techniques, and applications with JMP Pro,” Wiley, 2017.
41