计算机代考程序代写 database Bayesian data mining decision tree COMP3430 / COMP8430 Data wrangling

COMP3430 / COMP8430 Data wrangling
Lecture 6: Resolving data quality issues and data cleaning
(Lecturer: )

Lecture outline
Data quality issues
Forms of data pre-processing
An overview of data cleaning
Impute missing data Smooth noisy data Remove duplicate data Resolve inconsistent data
● ● ●
– – – –

Summary
2

Data quality issues
Various causes of data errors:
Data entry errors / subjective judgment Limited (computing) resources Security / accessibility trade-off Complex data, adaptive data
Volume of data
Redundant data
Multiple sources / distributed heterogeneous systems

– – – – – – –
3

Forms of data pre-processing
Data cleaning
Dirty data
Clean data
Data integration
Data transformation
Data reduction
-1
27
100
57
63
-0.01
0.27
1.0
0.57
0.63
A1 A2 ….. A100
R1
…… R1000
A1 A2 ….. A126
R1
…… R8000
Han and Kamber, DM Book, 2nd Ed. (Copyright © 2006 Elsevier Inc.)
4

Data cleaning: An overview






Resolve inconsistency
Impute missing data
A highly crucial data pre-processing step
Includes various tasks:
Dealing with missing data Handling outliers and noisy data
Removing redundant and duplicate data
Resolving inconsistencies
Extract data
Remove duplicate data
Smooth noisy data
5

Missing data
One of the most common data quality issues is missing data Absence of attribute values due to various reasons
Equipment malfunction
Not entered due to misunderstanding
Not considered important during data entry Deleted due to inconsistency with other values


– – – –
6

Impute missing data



– – – –
Manual imputation
Time consuming and infeasible
Automatic imputation
Global constant (for example, N/A)
Mean attribute value
Mean value of all records belonging to the same class
Inference-based (for example, Bayesian or decision tree) – use data mining and machine learning to predict most likely values to impute
7

Automatic imputation
Global Mean Group mean Inference-based
Gender
Weight
R1
M
65
R2
M
72
R3
F
54
R4
F
51
R5
M
?
R6
F
?
R7
M
82
Weight
65
72
54
51
N/A
N/A
82
Weight
65
72
54
51
64.8
64.8
82
Weight
65
72
54
51
73
52.5
82
Male
Female
<=30 >30 <=30 60 75 50 >30
62
Gender
Age
Age
8

Outliers and noisy data
Random error or variance in the data Incorrect values and errors occur due to
Faulty data collection instruments Data entry problems
Data transmission problems Technology limitation Misunderstanding of required data


– – – – –


Depending upon application outliers are important For example fraud detection or national security
9

Smooth noisy data
Binning
Sort data and partition into equal-frequency bins Smooth by bin means, bin median, bin boundaries

– –





Regression
Smooth by fitting data to regression functions
Clustering
Identify outliers not belonging to clusters
Manual inspection (active learning) of possible outliers
10

Binning (1)
● Equal-width / distance – DividetherangeintoN
● Equal-depth / frequency
– DividetherangeintoN intervals of (approximately) same number of samples
– Suitableforskeweddata distributions
– –
intervals of equal size
Width of intervals = (max-min)/N
Skewed data is not handled well
11

Binning (2)
Values
Bins equal-frequency
Smooth by bin means
Smooth by bin medians
Smooth by
bin boundaries
5
27
100
59
28
48
50
39
9
7
20
63
10
41
9
5
7
9
9
10
20
27
28
39
41
48
50
59
63
100
8
8
8
8
8
31
31
31
31
31
64
64
64
64
64
9
9
9
9
9
28
28
28
28
28
59
59
59
59
59
5
5
10
10
10
20
20
20
41
41
48
48
48
48
100
12

Regression
y=x+2
Data
Outliers
To be covered in more detail in the data mining course
13

Clustering
To be covered in more detail in the data mining course
Data
Outliers
14

Redundant data
Duplicate records occur within a single data source, or when combining multiple sources
The same entity/object might have different values in an attribute One attribute may be a derived attribute in another database Attribute values of the same object entered in different time
Redundant attributes can be identified by correlation analysis
Redundant records can be identified by deduplication or data integration (more about this later in the course)

– – –
● ●
15

Identifying redundant attributes (1)
16

Identifying redundant attributes (1)
Cancer
No cancer
Sum (row)
Observed
Expected
Observed
Expected
Smoking
250
90
200
360
450
Not smoking
50
210
1000
840
1050
Sum (column)
300
1200
1500
Therefore, smoking and cancer are highly correlated
17

Inconsistent data
Different formats, codes, and standards across different sources (even within a single source)
Resolving using external reference data






Lookup tables
E.g. Sydney, NSW, 7000 -> Sydney, NSW, 2000
Rules
Male or 0 -> M
Female or 1 -> F

18

Summary
Data cleaning is a crucial data pre-processing step
The data cleaning cycle includes several tasks:
Handling missing values, smoothing noisy data, removing redundant values, and resolving inconsistencies
Directions of future developments in data cleaning:
● ●



Efficient data cleaning tools for Big data, automated data cleaning, and interactive data cleaning, cleaning real-time and dynamic data
19