程序代写代做代考 algorithm scheme data mining Data vs Information

Data vs Information

Data Mining & Machine Learning
Lecture 2
Data Mining Basics

1

Discuss a framework for Knowledge Discovery
Examine some evaluation Methods for Classification
Discuss methods for pre-processing data

Session Goals

Data Mining is part of a larger iterative process of Knowledge Discovery
The steps in the Knowledge Discovery process are:
Defining the problem – identify the goals of your KD project
Data Collection – will involve cleaning and pre-processing the data
Data Mining – the model building step
Validating the model – will usually involve some type of Statistical Analysis
Deploying the model
Monitoring the model – model requires periodic re-evaluation on new data to assess if it is still appropriate
A Framework for Knowledge Discovery
Might involve retraining?

3

Data Quality
What kinds of data quality problems?
How can we detect problems with the data?
What can we do about these problems?

Examples of data quality problems:
Noise and outliers
missing values
duplicate data

Noise
Noise refers to modification of original values
Examples: distortion of a person’s voice when talking on a poor phone and “snow” on television screen

Two Sine Waves
Two Sine Waves + Noise

Data Preprocessing
Aggregation
Missing values
Data errors
Outliers
Discretization
Data Normalization
Data Balancing
Feature selection

Aggregation
Combining two or more attributes (or objects) into a single attribute (or object)

Purpose
Data reduction
Reduce the number of attributes or objects
Change of scale
Cities aggregated into regions, states, countries, etc
More “stable” data
Aggregated data tends to have less variability

Aggregation

Standard Deviation of Average Monthly Precipitation

Standard Deviation of Average Yearly Precipitation

Variation of Precipitation in Australia

Missing Values
Reasons for missing values
Information is not collected
(e.g., people decline to give their age and weight)
Attributes may not be applicable to all cases
(e.g., annual income is not applicable to children)

Handling missing values
Eliminate Data Objects
Estimate Missing Values
Ignore the Missing Value During Analysis
Replace with all possible values (weighted by their probabilities)

id mpg cylinders cubic inches hp
1 14 8 350 165
2 31.9 4 89 71
3 51.7 8 302 140
4 15 400 150
5 30.5
6 23 350 125
7 13 351 158
8 14 8 215
9 25.4 5 77
10 37.7 4 89 62

How to handle missing values?

id mpg cylinders cubic inches hp
1 14 8 350 165
2 31.9 4 89 71
3 51.7 8 302 140
4 15 6 400 150
5 30.5 6 275.86 129.22
6 23 6 350 125
7 13 6 351 158
8 14 8 275.86 215
9 25.4 5 275.86 77
10 37.7 4 89 62

How to handle missing values?
Replace missing values with mean values

11

Duplicate Data
Data set may include data objects that are duplicates, or almost duplicates of one another
Major issue when merging data from heterogeous sources

Examples:
Same person with multiple email addresses

Data cleaning
Process of dealing with duplicate data issues

What are the problems with the raw data set below?

Data Errors
Customer ID Zip Code Gender Salary Age Marital Status Trans. Amount
1001 10048 M 75,000 C M 5,000
1002 J2S7K7 F -40,000 40 W 3,000
1003 90210 10,000,000 45 S 6,000
1004 6269 M 50,000 0 S 5,000
1005 55101 F 99,999 30 D 10,000

Extreme values close to the limits of the data range or do not follow the trend of the remaining data
May represent errors occurred during data acquisition
May have a negative impact on the data mining method

Outliers

Problem:
The ranges of certain variables may differ greatly from each other which can have a negative effect on the data mining technique. Variables with greater ranges have stronger impact on the results than others.

Data Normalization
id mpg cylinders cubic inches hp
1 14 8 350 165
2 31.9 4 89 71
3 51.7 8 302 140
4 15 8 400 150
5 30.5 4 144 116.55
6 23 4 350 125

15

Solution:
Normalize the data to standardize the scale of each variable.
Example:
Min-Max Normalization

Data Normalization
id mpg* cylinders* cubic inches* hp*
1 0 1 0.84 1
2 0.47 0 0 0
3 1 1 0.68 0.73
4 0.03 1 1 0.84
5 0.44 0 0.18 0.48
6 0.24 0 0.84 0.57

Other techniques:
Decimal Scaling
Z-Score normalization

16

Often real world data is imbalanced
For a credit card stream, 99% of transactions are genuine while only 1% are in fraud
In such cases any machine learning algorithm will have difficulty in learning to find patterns that correlate to the “fraud” class
In such cases performance can be improved by either scaling down the majority (“genuine”) class or creating new data for the minority (“fraud”) class
Data Balancing

Motivation: Removal of redundant attributes has the following advantages:
Results in a smaller, and more understandable model
Reduces the amount of time spent in model building – increases efficiency
An improvement in classification accuracy

Feature Selection

In general there are two different approaches:
Forward selection
Backward Elimination

General Approaches

outlook

temp

humidity

windy

O, T

O, H

O, W

T, H

T, W

H, W

O, T, H

O, T, W

O, H, W

T, H, W

O, T, H, W

In forward selection we start with an empty set A of attributes
At each step an attribute is tried as part of A and a performance measure is evaluated (for example Classification Accuracy)
The attribute that produces the best performance becomes part of set A.
We now try each of the remaining attributes as part of set A and choose the one with the highest Accuracy to form part of A.
The entire process is repeated until no more attributes can be added to set A – i.e. at a particular round (iteration) all attributes when added decrease, rather than increase the Accuracy.
The set A at the end of the process contains the set of non redundant attributes
Forward Selection
Note that an attribute that was discarded in an iteration is put back in the set for contention. Why?

Parma Nand (PN) –
Similar to forward selection but the set A initially consists of the full set of attributes.
At each step we eliminate (rather than add) the attribute that leads to the highest Information Gain or least change in accuracy.
The process is repeated until we reach an iteration when every attribute that remains in set A leads to a loss of Information Gain or an unacceptable loss in accuracy.
The attributes that remain in set A contain the list of non redundant attributes.
Backward Elimination

Data Mining Tasks

22

Example problems: weather data, contact lenses, irises, labor negotiations
Classification learning is supervised
Scheme is being provided with actual outcome
Outcome is called the class of the example
Success can be measured on fresh data for which class labels are known ( test data)

Classification Learning

The confusion matrix is a good summary of a classifier’s performance

Computed Computed
Accept Reject
Accept True False
Accept Reject
———————————————————–
Reject False True
Accept Reject

Evaluating Success in Classification

The matrix provides us with a good basis for comparing classifiers
For example, Models A and B represent two different classifiers for the credit card application problem
Which model is better, A or B ?

Model Computed Computed Model Computed Computed
A Accept Reject B Accept Reject
Accept 600 25 Accept 600 75
Reject 75 300 Reject 25 300
Evaluating Success in Classification
Compare fraud detection vs
Cancer detection

Given a set of records each of which contain some number of items from a given collection;
Produce dependency rules which will predict occurrence of an item based on occurrences of other items.
Association Rule Discovery: Definition

Rules Discovered:
{Milk} –> {Coke}
{Diaper, Milk} –> {Beer}

Marketing and Sales Promotion:
Let the rule discovered be
{Bagels, … } –> {Potato Chips}
Potato Chips as consequent => Can be used to determine what should be done to boost its sales.
Bagels in the antecedent => Can be used to see which products would be affected if the store discontinues selling bagels.
Bagels in antecedent and Potato chips in consequent => Can be used to see what products should be sold with Bagels to promote sale of Potato chips!
Association Rule Discovery

27

Clustering

Clustering

Finding groups of items that are similar
Clustering is unsupervised
The class of an example is not known
Success of clustering often measured subjectively
Example problem: iris data without class

Numeric Prediction

Like classification learning but with numeric “class”
Learning is supervised
Scheme is being provided with target value
Success is measured on test data
Example: modified version of weather data

Models the relationship between variable Y and variable X using a linear fit: Y = a X + b
Determines the best straight line through the given data

Example: Linear Regression

)
min(
)
max(
)
min(
*
X
X
X
X
X


=

TID
Items
1
Bread, Coke, Milk
2
Beer, Bread
3
Beer, Coke, Diaper, Milk
4
Beer, Bread, Diaper, Milk
5
Coke, Diaper, Milk

TID
Items

1
Bread, Coke, Milk

2
Beer, Bread

3
Beer, Coke, Diaper, Milk

4
Beer, Bread, Diaper, Milk

5
Coke, Diaper, Milk

/docProps/thumbnail.jpeg