Data vs Information
Data Mining & Machine Learning
Lecture 2
Data Mining Basics
1
Discuss a framework for Knowledge Discovery
Examine some evaluation Methods for Classification
Discuss methods for pre-processing data
Session Goals
Data Mining is part of a larger iterative process of Knowledge Discovery
The steps in the Knowledge Discovery process are:
Defining the problem – identify the goals of your KD project
Data Collection – will involve cleaning and pre-processing the data
Data Mining – the model building step
Validating the model – will usually involve some type of Statistical Analysis
Deploying the model
Monitoring the model – model requires periodic re-evaluation on new data to assess if it is still appropriate
A Framework for Knowledge Discovery
Might involve retraining?
3
Data Quality
What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?
Examples of data quality problems:
Noise and outliers
missing values
duplicate data
Noise
Noise refers to modification of original values
Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen
Two Sine Waves
Two Sine Waves + Noise
Data Preprocessing
Aggregation
Missing values
Data errors
Outliers
Discretization
Data Normalization
Data Balancing
Feature selection
Aggregation
Combining two or more attributes (or objects) into a single attribute (or object)
Purpose
Data reduction
Reduce the number of attributes or objects
Change of scale
Cities aggregated into regions, states, countries, etc
More “stable” data
Aggregated data tends to have less variability
Aggregation
Standard Deviation of Average Monthly Precipitation
Standard Deviation of Average Yearly Precipitation
Variation of Precipitation in Australia
Missing Values
Reasons for missing values
Information is not collected
(e.g., people decline to give their age and weight)
Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)
Handling missing values
Eliminate Data Objects
Estimate Missing Values
Ignore the Missing Value During Analysis
Replace with all possible values (weighted by their probabilities)
id mpg cylinders cubic inches hp
1 14 8 350 165
2 31.9 4 89 71
3 51.7 8 302 140
4 15 400 150
5 30.5
6 23 350 125
7 13 351 158
8 14 8 215
9 25.4 5 77
10 37.7 4 89 62
How to handle missing values?
id mpg cylinders cubic inches hp
1 14 8 350 165
2 31.9 4 89 71
3 51.7 8 302 140
4 15 6 400 150
5 30.5 6 275.86 129.22
6 23 6 350 125
7 13 6 351 158
8 14 8 275.86 215
9 25.4 5 275.86 77
10 37.7 4 89 62
How to handle missing values?
Replace missing values with mean values
11
Duplicate Data
Data set may include data objects that are duplicates, or almost duplicates of one another
Major issue when merging data from heterogeous sources
Examples:
Same person with multiple email addresses
Data cleaning
Process of dealing with duplicate data issues
What are the problems with the raw data set below?
Data Errors
Customer ID Zip Code Gender Salary Age Marital Status Trans. Amount
1001 10048 M 75,000 C M 5,000
1002 J2S7K7 F -40,000 40 W 3,000
1003 90210 10,000,000 45 S 6,000
1004 6269 M 50,000 0 S 5,000
1005 55101 F 99,999 30 D 10,000
Extreme values close to the limits of the data range or do not follow the trend of the remaining data
May represent errors occurred during data acquisition
May have a negative impact on the data mining method
Outliers
Problem:
The ranges of certain variables may differ greatly from each other which can have a negative effect on the data mining technique. Variables with greater ranges have stronger impact on the results than others.
Data Normalization
id mpg cylinders cubic inches hp
1 14 8 350 165
2 31.9 4 89 71
3 51.7 8 302 140
4 15 8 400 150
5 30.5 4 144 116.55
6 23 4 350 125
15
Solution:
Normalize the data to standardize the scale of each variable.
Example:
Min-Max Normalization
Data Normalization
id mpg* cylinders* cubic inches* hp*
1 0 1 0.84 1
2 0.47 0 0 0
3 1 1 0.68 0.73
4 0.03 1 1 0.84
5 0.44 0 0.18 0.48
6 0.24 0 0.84 0.57
Other techniques:
Decimal Scaling
Z-Score normalization
16
Often real world data is imbalanced
For a credit card stream, 99% of transactions are genuine while only 1% are in fraud
In such cases any machine learning algorithm will have difficulty in learning to find patterns that correlate to the “fraud” class
In such cases performance can be improved by either scaling down the majority (“genuine”) class or creating new data for the minority (“fraud”) class
Data Balancing
Motivation: Removal of redundant attributes has the following advantages:
Results in a smaller, and more understandable model
Reduces the amount of time spent in model building – increases efficiency
An improvement in classification accuracy
Feature Selection
In general there are two different approaches:
Forward selection
Backward Elimination
General Approaches
outlook
temp
humidity
windy
O, T
O, H
O, W
T, H
T, W
H, W
O, T, H
O, T, W
O, H, W
T, H, W
O, T, H, W
In forward selection we start with an empty set A of attributes
At each step an attribute is tried as part of A and a performance measure is evaluated (for example Classification Accuracy)
The attribute that produces the best performance becomes part of set A.
We now try each of the remaining attributes as part of set A and choose the one with the highest Accuracy to form part of A.
The entire process is repeated until no more attributes can be added to set A – i.e. at a particular round (iteration) all attributes when added decrease, rather than increase the Accuracy.
The set A at the end of the process contains the set of non redundant attributes
Forward Selection
Note that an attribute that was discarded in an iteration is put back in the set for contention. Why?
Parma Nand (PN) –
Similar to forward selection but the set A initially consists of the full set of attributes.
At each step we eliminate (rather than add) the attribute that leads to the highest Information Gain or least change in accuracy.
The process is repeated until we reach an iteration when every attribute that remains in set A leads to a loss of Information Gain or an unacceptable loss in accuracy.
The attributes that remain in set A contain the list of non redundant attributes.
Backward Elimination
Data Mining Tasks
22
Example problems: weather data, contact lenses, irises, labor negotiations
Classification learning is supervised
Scheme is being provided with actual outcome
Outcome is called the class of the example
Success can be measured on fresh data for which class labels are known ( test data)
Classification Learning
The confusion matrix is a good summary of a classifier’s performance
Computed Computed
Accept Reject
Accept True False
Accept Reject
———————————————————–
Reject False True
Accept Reject
Evaluating Success in Classification
The matrix provides us with a good basis for comparing classifiers
For example, Models A and B represent two different classifiers for the credit card application problem
Which model is better, A or B ?
Model Computed Computed Model Computed Computed
A Accept Reject B Accept Reject
Accept 600 25 Accept 600 75
Reject 75 300 Reject 25 300
Evaluating Success in Classification
Compare fraud detection vs
Cancer detection
Given a set of records each of which contain some number of items from a given collection;
Produce dependency rules which will predict occurrence of an item based on occurrences of other items.
Association Rule Discovery: Definition
Rules Discovered:
{Milk} –> {Coke}
{Diaper, Milk} –> {Beer}
Marketing and Sales Promotion:
Let the rule discovered be
{Bagels, … } –> {Potato Chips}
Potato Chips as consequent => Can be used to determine what should be done to boost its sales.
Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels.
Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips!
Association Rule Discovery
27
Clustering
Clustering
Finding groups of items that are similar
Clustering is unsupervised
The class of an example is not known
Success of clustering often measured subjectively
Example problem: iris data without class
Numeric Prediction
Like classification learning but with numeric “class”
Learning is supervised
Scheme is being provided with target value
Success is measured on test data
Example: modified version of weather data
Models the relationship between variable Y and variable X using a linear fit: Y = a X + b
Determines the best straight line through the given data
Example: Linear Regression
)
min(
)
max(
)
min(
*
X
X
X
X
X
–
–
=
TID
Items
1
Bread, Coke, Milk
2
Beer, Bread
3
Beer, Coke, Diaper, Milk
4
Beer, Bread, Diaper, Milk
5
Coke, Diaper, Milk
TID
Items
1
Bread, Coke, Milk
2
Beer, Bread
3
Beer, Coke, Diaper, Milk
4
Beer, Bread, Diaper, Milk
5
Coke, Diaper, Milk
/docProps/thumbnail.jpeg