Data vs Information
Data Mining & Machine Learning
Session 1
Course Overview and Introduction
1
Formulate a definition of Data Mining
Examine the different knowledge representation methods
Discuss a framework for Knowledge Discovery
Examine some landmark successes
Session Goals
We are living in the era of BigData
Lack of data is not a problem any longer but our ability to keep pace with the arrival of new data is a major issue
Potentially valuable resource
Raw data is useless: need techniques to automatically extract knowledge from it
Data: is simply a set of recorded facts
Knowledge: Insights into how systems behave
Data vs. Knowledge
3
There is often information “hidden” in the data that is
not readily evident
Human analysts may take weeks to discover useful information
Much of the data is never analyzed at all
Mining Large Data Sets – Motivation
Data Gap
Number of analysts
4
Emerged in the mid 1990’s and has grown rapidly world-wide.
Professionals referred to by a number of different names: data scientists, information scientists, data analysts, etc.
Kaggle (www.kaggle.com) has more then 845,000 data scientists registered world-wide
DataMine (http://www.datamine.com) a local company (with International branches) specializes in providing Data Mining consultancy services to clients
Many other companies have in-house Data Analysts
Data Mining (Science) a young discipline
5
Example 1: customer credit risk
Given: customers described by features such as income, debt level, employment history, etc.
Problem: selection of prospective customers that meet the good credit risk criteria
Data: historical records of customers and their loan payment details
Example 2: cow culling
Given: cows described by 700 features
Problem: selection of cows that should be culled
Data: historical records and farmers’ decisions
Information is crucial
6
Extraction of implicit, previously unknown, and potentially useful information from data
Needed: algorithms that automatically detect and extract patterns in the data
Strong patterns can be used to make predictions
Problem 1: most patterns are either trivial or are already known
Problem 2: patterns may be imprecise (or even
completely spurious) if data is corrupted or missing
What exactly is Data Mining?
7
What is (not) Data Mining?
What is Data Mining?
Certain names are more prevalent in certain US locations (O’Brien, O’Rurke, O’Reilly… in Boston area)
Group together similar documents returned by search engine according to their context (e.g. Amazon rainforest, Amazon.com,)
What is not Data Mining?
Look up people in phone directory
Query a Web search engine for information about “Amazon”
8
Draws ideas from machine learning/AI, pattern recognition, statistics, and database systems
Traditional Techniques
may be unsuitable due to
Enormity of data
High dimensionality
of data
Heterogeneous,
distributed nature
of data
Origins of Data Mining
Machine Learning/
Pattern
Recognition
Statistics/
AI
Data Mining
Database systems
Technical basis for data mining: algorithms for automatically acquiring patterns from data
Patterns learnt can be used to predict outcome in a new situation
Can be used to understand and explain how prediction is made (maybe even more important)
Methods originate from both artificial intelligence and statistics
Machine Learning Techniques
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model as a mapping of attributes to the class attribute
Goal: previously unseen records should be assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Classification: Definition
Classification Example
categorical
categorical
continuous
class
Test
Set
Training
Set
Model
Learn
Classifier
Taken from Data Mining by Tan, Steinbach and Kumar, 2005
12
Direct Marketing
Goal: Reduce cost of mailing by targeting a set of consumers likely to buy a new model of IPhone Approach:
Use the data for a similar product introduced before.
We know which customers decided to buy and which decided otherwise. This {buy, don’t buy} decision forms the class attribute.
Collect various demographic, lifestyle, and other related information about all such customers.
Age Category, Gender, how much they earn, etc.
Use this information as input attributes to learn a classifier model.
Classification: Further Applications
Fraud Detection
Goal: Predict fraudulent cases in credit card transactions.
Approach:
Use credit card transactions and the information on its account-holder as attributes.
When does a customer buy, what does he buy, how often he pays on time, etc
Label past transactions as fraud or honest transactions. This forms the class attribute.
Learn a model for the class of the transactions.
Use this model to detect fraud by observing credit card transactions on an account.
Classification: Further Applications
Customer Attrition/Churn:
Goal: To predict whether a customer is likely to be lost to a competitor.
Approach:
Use detailed record of transactions with each of the past and present customers, to find attributes.
How often the customer calls, where he calls, what time-of-the day he calls most, his financial status, marital status, etc.
Label the customers as loyal or disloyal.
Find a model for loyalty.
Classification: Further Applications
Sky Survey Cataloging
Goal: To predict class (star or galaxy) of sky objects, especially visually faint ones, based on the telescopic survey images (from Palomar Observatory).
3000 images with 23,040 x 23,040 pixels per image.
Approach:
Segment the image.
Measure image attributes (features) – 40 of them per object.
Model the class based on these features.
Success Story: Could find 16 new high red-shift quasars, some of the farthest objects that are difficult to find!
Classification: Further Applications
16
Classifying Galaxies
Early
Intermediate
Late
Data Size:
72 million stars, 20 million galaxies
Object Catalog: 9 GB
Image Database: 150 GB
Class:
Stages of Formation
Attributes:
Image features,
Characteristics of light waves received, etc.
Courtesy: http://aps.umn.edu
if-then rules
If tear production rate = reduced
then recommendation = none
Otherwise, if age = young and astigmatic = no then recommendation = soft
Structural Descriptions
Taken from: Introduction to Data Mining by Witten et al. 2011
Conditions for playing a game
Another Example
If outlook = sunny and humidity = high then play = no
If outlook = rainy and windy = true then play = no
If outlook = overcast then play = yes
If humidity = normal then play = yes
If none of the above then play = yes
20
The contact lenses data
None
Reduced
Yes
Hypermetrope
Pre-presbyopic
None
Normal
Yes
Hypermetrope
Pre-presbyopic
None
Reduced
No
Myope
Presbyopic
None
Normal
No
Myope
Presbyopic
None
Reduced
Yes
Myope
Presbyopic
Hard
Normal
Yes
Myope
Presbyopic
None
Reduced
No
Hypermetrope
Presbyopic
Soft
Normal
No
Hypermetrope
Presbyopic
None
Reduced
Yes
Hypermetrope
Presbyopic
None
Normal
Yes
Hypermetrope
Presbyopic
Soft
Normal
No
Hypermetrope
Pre-presbyopic
None
Reduced
No
Hypermetrope
Pre-presbyopic
Hard
Normal
Yes
Myope
Pre-presbyopic
None
Reduced
Yes
Myope
Pre-presbyopic
Soft
Normal
No
Myope
Pre-presbyopic
None
Reduced
No
Myope
Pre-presbyopic
hard
Normal
Yes
Hypermetrope
Young
None
Reduced
Yes
Hypermetrope
Young
Soft
Normal
No
Hypermetrope
Young
None
Reduced
No
Hypermetrope
Young
Hard
Normal
Yes
Myope
Young
None
Reduced
Yes
Myope
Young
Soft
Normal
No
Myope
Young
None
Reduced
No
Myope
Young
Recommended lenses
Tear production rate
Astigmatism
Spectacle prescription
Age
20
21
The contact lenses data
None
Normal
Yes
Hypermetrope
Pre-presbyopic
None
Normal
No
Myope
Presbyopic
Hard
Normal
Yes
Myope
Presbyopic
Soft
Normal
No
Hypermetrope
Presbyopic
None
Normal
Yes
Hypermetrope
Presbyopic
Soft
Normal
No
Hypermetrope
Pre-presbyopic
Hard
Normal
Yes
Myope
Pre-presbyopic
Soft
Normal
No
Myope
Pre-presbyopic
hard
Normal
Yes
Hypermetrope
Young
Soft
Normal
No
Hypermetrope
Young
Hard
Normal
Yes
Myope
Young
Soft
Normal
No
Myope
Young
Recommended lenses
Tear production rate
Astigmatism
Spectacle prescription
Age
21
Another Type of Structural Description: The Decision Tree
22
Data : 398 cars with information relating to fuel efficiency
Numeric Prediction: Linear Regression
car ID cylinders engine size power output weight acceleration mpg
1 8.0 307.0 130.0 3504.0 12.0 18.0
2 4.0 113.0 95.0 2372.0 15.0 24.0
3 8.0 318.0 150.0 3436.0 11.0 18.0
4 6.0 200.0 85.0 2587.0 16.0 21.0
5 8.0 302.0 140.0 3449.0 10.5 17.0
mpg =-0.3499 * cylinders -0.0013 * engine size -0.0394 * power output
-0.0054 * weight -0.0154 * acceleration +46.024
Can machines really learn?
Data from labour negotiations
25
Decision Trees for the labour data
26
Bias can arise in two different contexts:
Bias introduced by learning algorithm
Bias introduced by data
Every learning algorithm uses some form of heuristic to search for patterns
Heuristics are a must as it is often impossible to search the entire data space to extract patterns
Data bias results from imperfect training datasets – some patterns that occur in the future are absent in training data
Conversely, some patterns in training data may not manifest in future data
Recognising Bias in Machine Learning
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Single
75K
?
Yes
Married
50K
?
No
Married
150K
?
Yes
Divorced
90K
?
No
Single
40K
?
No
Married
80K
?
10
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
Yes
Single
125K
No
2
No
Married
100K
No
3
No
Single
70K
No
4
Yes
Married
120K
No
5
No
Divorced
95K
Yes
6
No
Married
60K
No
7
Yes
Divorced
220K
No
8
No
Single
85K
Yes
9
No
Married
75K
No
10
No
Single
90K
Yes
10
Refund
Marital
Status
Taxable
Income
Cheat
No
Single
75K
?
Yes
Married
50K
?
No
Married
150K
?
Yes
Divorced
90K
?
No
Single
40K
?
No
Married
80K
?
10
/docProps/thumbnail.jpeg