程序代写代做代考 data mining python algorithm Hive finance database flex Data Cleansing

Data Cleansing — 1

Faculty of Information Technology, Monash University, Australia

FIT5196 week 6

(Monash) FIT5196 1 / 24

Data Wrangling Process

(Monash) FIT5196 2 / 24

Outline

1 Data Anomalies

2 Exploratory Data Analysis

3 Summary

(Monash) FIT5196 3 / 24

Data Anomalies

Data Cleansing

Data Cleansing: A process of detecting and removing errors and
inconsistencies from data in order to improve the quality of data.

(Monash) FIT5196 4 / 24

Data Anomalies

Data Cleansing

Data Cleansing: A process of detecting and removing errors and
inconsistencies from data in order to improve the quality of data.

Data Cleansing: An iterative process
Data Glitches: Monsters in Your Data 165

Define
&

Identify

Detect
&

Quantify

Clean
&

Rectify

Measure
&

Verify

-Types of Glitches
-Identify those that impact the User’s
application

-Methods and algorithms for detection
-Quantify prevalence

-Develop suitable repair strategies
-Repair the data

-Measure improvement in quality
-Verify data meet tolerance, cost
and bias criteria

Iterate

Fig. 1 Overview of data quality process: the process is flexible to allow the user to customize data
quality assessment and repair to suit resource and accuracy requirements

between cleaning and distortion, within the user’s resource limitations and quality
specifications. In this chapter, we provide a brief introduction to these concepts and
explain data quality assessment within the framework of statistical theory.

The rest of the chapter is organized as follows. Section 2 provides an overview
of the statistical data quality process, emphasizing its iterative nature. In Sect. 3,
we discuss new types of glitches and complex glitch patterns that can be leveraged
to develop efficient context-dependent cleaning strategies. In Sect. 4, we describe
methods for detecting different types of glitches, while Sect. 5 focuses on assessing
the quality of data. Section 6 offers suggestions for cleaning and repairing data.
In Sect. 7, we discuss how to choose a strategy from a multitude of potential
strategies, based on the notion of the trade-off between cleaning and distortion. A
brief literature overview is provided in Sect. 8. Finally, we present our conclusions
in Sect. 9.

2 Data Cleaning, an Iterative Process

Data cleaning is an iterative process. Typically, a user is presented with a data set
D and would like to clean it to meet cleanliness specifications (“no more than
10 % missing values”) subject to resource and cost considerations. The process
of cleaning could potentially introduce new glitches and make the data dirtier.
Therefore, after each cleaning step, the user remeasures the quality and repeats the
process until the specifications are satisfied.

A schematic version of the data quality process is shown in Fig. 1. It has four
broad stages. We give a brief overview below and describe each stage in detail in
the following sections.

An overview of data quality process, from “Data Glitches: Monsters in Your Data” by Tamraparni Dasu, in “Handbook of Data Quality” 2013

(Monash) FIT5196 4 / 24

Data Anomalies

Data Cleansing

Data Cleansing: A process of detecting and removing errors and
inconsistencies from data in order to improve the quality of data.

Data Cleansing: An iterative process

(Monash) FIT5196 4 / 24

Data Anomalies

Data Anomalies (i.e., Glitches or Errors)

Data Anomalies describes the distortion of the data because of any of the
problems that might encounter in the life cycle of data that includes its
capture, storage, update, transmission, access, archive, restore, deletion and
purge.
Some common data quality problems
É Missing data
É Inconsistent and faulty data
É Outliers
É Duplicates

(Monash) FIT5196 5 / 24

Data Anomalies

Data Anomaly Classification: source-based1

1
From “Data Cleaning: Problems and Current Approaches” by Rahm and Do

(Monash) FIT5196 6 / 24

Data Anomalies

Data Anomaly Classification: source-based1

1
From “Data Cleaning: Problems and Current Approaches” by Rahm and Do

(Monash) FIT5196 6 / 24

Data Anomalies

Data Anomaly Classification: source-based1

1
From “Data Cleaning: Problems and Current Approaches” by Rahm and Do

(Monash) FIT5196 6 / 24

Data Anomalies

Data Anomaly Classification: source-based1

1
From “Data Cleaning: Problems and Current Approaches” by Rahm and Do

(Monash) FIT5196 6 / 24

Data Anomalies

Data Anomalies Classification: type-based

Syntactical Anomalies: format and values

Semantic Anomalies: comprehensiveness and non-redundancy

Coverage Anomalies: missing values

(Monash) FIT5196 7 / 24

Data Anomalies

Data Anomalies Classification: type-based

Syntactical Anomalies: format and values
É Lexical errors: data format discrepancies in terms of database; spelling
errors, typos in terms of linguistics.

É Domain format errors: inconsistent value format of an attribute, e.g.,
Buntine, Wray Lindsay v.s. Wray L. Buntine

É Irregularities: the non-uniform use of values, units and abbreviations?e.g.,
salary in difference currencies.

Semantic Anomalies: comprehensiveness and non-redundancy

Coverage Anomalies: missing values

(Monash) FIT5196 7 / 24

Data Anomalies

Data Anomalies Classification: type-based

Syntactical Anomalies: format and values
Semantic Anomalies: comprehensiveness and non-redundancy
É Integrity constraint violations
É Contradictions: violation of dependencies between attributes, e.g., AGE and
DOB.

É Duplicates: observations representing the same entity.
É Invalid observations

Coverage Anomalies: missing values

(Monash) FIT5196 7 / 24

Data Anomalies

Data Anomalies Classification: type-based

Syntactical Anomalies: format and values

Semantic Anomalies: comprehensiveness and non-redundancy
Coverage Anomalies: missing values
É Missing values: due to omissions while collecting the data
É Missing observations:

(Monash) FIT5196 7 / 24

Data Anomalies

Taxonomy of Dirty Data2

Dirty data manifests itself in three different ways:

missing data

not missing but wrong data,

not missing and not wrong but unusable,

2Kim et al, “A Taxonomy of Dirty Data”, DMKD 2003
(Monash) FIT5196 8 / 24

Data Anomalies

Taxonomy of Dirty Data2

Dirty data manifests itself in three different ways:
missing data
É Missing data where there is no Null-not-allowed constraint
É Missing data where Null-not-allowed constraint should be enforced

not missing but wrong data,

not missing and not wrong but unusable,

2Kim et al, “A Taxonomy of Dirty Data”, DMKD 2003
(Monash) FIT5196 8 / 24

Data Anomalies

Taxonomy of Dirty Data2

Dirty data manifests itself in three different ways:

missing data
not missing but wrong data, due to
É Integrity constraints

− violation of data type constraint, including value range
− violation of non-null uniqueness constraint, i.e., duplicated data
− violation of referential integrity
− Wrong categorical data
− Outdated temporal data
− Inconsistent spatial data

É Data Entry error involving a single table
− Data entry error involving a single field: erroneous entry, misspelling,

extraneous data
− Data entry error involving multiple fields: entry into wrong fields, wrong

derived-field data

not missing and not wrong but unusable,

2Kim et al, “A Taxonomy of Dirty Data”, DMKD 2003
(Monash) FIT5196 8 / 24

Data Anomalies

Taxonomy of Dirty Data2

Dirty data manifests itself in three different ways:

missing data

not missing but wrong data,
not missing and not wrong but unusable, due to
É Different data for the same entity across multiple databases
É Ambiguous data due to: the use of abbreviation (Dr. for doctor or drive),
É Incomplete context (e.g., Sydney of Australia or Canada)
É The use of abbreviation (e.g., ste for suite, rd for road, st for street, etc)
É Alias/nick name (e.g., Bill Clinton, President Clinton)
É Encoding formats (e.g, ASCII, …)
É Representations (e.g., negative number, precision, fraction)
É Measurement units (e.g., data, time, currency, weight, area, etc.)
É Uses of special characters (e.g., space, dash, parenthesis in phone numbers)
in concatenated data

2Kim et al, “A Taxonomy of Dirty Data”, DMKD 2003
(Monash) FIT5196 8 / 24

Data Anomalies

Quality measures

Adapted from “Problems, Methods, and Challenges in Comprehensive Data Cleansing” By Muller and Fretag

(Monash) FIT5196 9 / 24

Data Anomalies

Data Anomalies: Looking for Errors

ID Landgrabbed ISO Landgrabber Base Sector Hectares Production Projected investment
Status of
deal Start End

A23
Algeria DZA Al Qudra UAE Finance 31000.00 Milk, olive oil,

potatoes
Done

06/2005 01/2015

A45
Algeria Al Qudra UAE real estate 31000.00 Milk, olive oil,

potatoes
Done

06/1905 01/2012

Angola CAMC Engineering
Co. Ltd

China Construction 1500.00 Rice US$77 million Done

06/2010 05/2005

A1
Philippines Kuwait Kuwait Government 20000 Maize, rice In process

10/2015 12/1917

A34
菲律宾 Zuellig Group Malaysia Agribusiness,

health care
30000 Maize In process

06/2016 08/2020

A45
Philippines Oman Oman Government 10,000 Rice 150m Processing

06/1909 09/1917

A34

Philippines PHL Brunei Investment
Authority

Brunei Gover 10,000 Rice Proposed

03/2016
A56 Philippines China China 100,280,000 Various Suspended 02/2000 11/2001

A54
Philippines Green Future

Innovation
Japan 11,000 Sugar cane US$120 million Done

06/2014 09/2015

A4
argantin ARG Beidahuang CH 320000 Maize, soybeans, wheat US$1,500

million
Suspended

12/1900 07/1901

A65
Tanzânia

Nirmal Seeds
Indië

Agribusiness 30000 Seeds In process
03/2013 06/2016

tanzania Yes Bank India Finance 50000 Rice, wheat In process
06/2010 06/2017

A3
Tansania Export Trading

Group
Singapore Agribusiness 8000 Rice Done

12/2015 10/2018 Syntactical Typos Irreglarities

A23
Brazil BRA Clean Energy

Brazil
UK 30,000 Sugar cane Done

03/2012 09/2013 Coverage

N67

BRA Adecoagro US Agribusiness 165,000 Cattle, cofee, grains,
soybeans, sugar cane

98,000,000 Done

10/2010 07/2005

A67
brazil BRA Archer Daniels

Midland
US Agribusiness 12,000 Oil palm In process

06/2014 01/2015

A56

Brasil Black River Asset
Management

United
states

Finance 50,000 Crops 20000000 Done

02/2010 2015

(Monash) FIT5196 10 / 24

Data Anomalies

Data Anomalies: Some Errors

ID Landgrabbed ISO Landgrabber Base Sector Hectares Production Projected investment Status of deal Start End

A23 Algeria DZA Al Qudra UAE Finance 31000.00
Milk, olive oil,

potatoes Done 06/2005 01/2015

A45 Algeria Al Qudra UAE real estate 31000.00
Milk, olive oil,

potatoes Done 06/1905 01/2012

A3 Angola
CAMC Engineering

Co. Ltd China Construction 1500.00 Rice US$77 million Done 06/2010 05/2005

A1 Philippines Kuwait Kuwait Government 20000 Maize, rice In process 10/2015 12/1917

A34 菲律宾 Zuellig Group Malaysia
Agribusiness,
health care 30000 Maize In process 06/2016

A45 Philippines Oman Oman Government 10,000 Rice 150m Processing 06/1909 09/1917

A34 Philippines PHL
Brunei Investment

Authority Brunei Gover 10,000 Rice Proposed 03/2016

A56 Philippines China China 100,280,000 Various Suspended 02/2000 11/2001

A54 Philippines Green Future
Innovation

Japan 11,000 Sugar cane US$120 million Done 06/2014 09/2015

A4 argantin ARG Beidahuang CH 320000
Maize, soybeans,

wheat
US$1,500
million Suspended 12/1900 07/1901 What does it affect

A65 Tanzânia Nirmal Seeds Indië Agribusiness 30000 Seeds In process 03/2013 06/2016

tanzania Yes Bank India Finance 50000 Rice, wheat In process 06/2010

A3 Tansania Export Trading
Group

Singapore Agribusiness 8000 Rice Done 12/2015 10/2018 Systamatic Typos Irreglarities Mandatory fields
A23 Brazil BRA

Clean Energy
Brazil UK 30,000 Sugar cane Done 03/2012 09/2013 Semantic Consistency Integrity constraint validation redundanct outliers

N67 BRA Adecoagro US Agribusiness 165,000
Cattle, cofee,

grains, soybeans,
sugar cane

98,000,000 Done 10/2010 07/2005

Coverage

A67 brazil In process 06/2014

A56 Brasil
Black River Asset

Management
United
states Finance 50,000 Crops 20000000 Done 02/2010 2015

Some problems:
1- Consistency in country names Integrity Validity Cross validation & Typos
2- ID: Consistency and duplication completeness Covergae
3- Irrigularities in contry names consistency Schema confrormance Mandatory fields
4- Date: End before start Unifromity irregualrity
5- Date: End 1917 while processing density Coverage
6- Done in 2018
7- What kind of missing value in
8- Consistency in status of the deal
9- Kind of missing values on End data If it is in process, the end date is missing : MAR or MCAR or MNAR

This one is MNAR as it is missing mostly when the status is in process. The in process is an indictor in this case not a dependent variable.

10- Outlier in Hectars 100,2180,000
11- Consistency in units in projected inverstments
12- Msiing in A6 (second last). What is the best to do?
13- Base columns, abbriviation or full name?
14…. more

(Monash) FIT5196 11 / 24

Data Anomalies

Data Cleansing Blueprint

1 Data Auditing (or Analysis): detect errors and inconsistencies in the data
É Data profiling: focuses on the instance analysis of individual attributes
É Data mining: descriptive data mining, e.g., clustering, summarisation,
association discovery, etc.

2 Definition of transformation workflow: define a sequence of operations on
the data, used to detect and eliminate anomalies
É Early data cleaning steps: correct single-source instance problems
É Later data cleaning steps: deal with schema/data integration and clean
multi-source problems.

3 Verification: test and evaluate the correctness and effectiveness of a
transformation workflow

4 Data transformation: Execute the transformation steps
5 Post-processing and controlling: inspect the results to verify the correctness

of the specified operations.

(Monash) FIT5196 12 / 24

Exploratory Data Analysis

Outline

1 Data Anomalies

2 Exploratory Data Analysis

3 Summary

(Monash) FIT5196 13 / 24

Exploratory Data Analysis

Two types of variables:
É categorial variable
É numerical variable

Two types of EDA
É Non-graphical: summary statistics
É Graphical: various plots

EDA
É Univariate
É Multivariate

(Monash) FIT5196 14 / 24

Exploratory Data Analysis

Univariate non-graphical methods: Categorial data

Categorical variables: values or observations that can be sorted into groups
or categories.
É Examples: Sex, Eye colour and blood type.

The characteristics of interest for a categorial variable
É the range of values
É the frequency of occurrence for each value
É univariate non-graphical technique: tabulation of the frequencies

(Monash) FIT5196 15 / 24

Exploratory Data Analysis

Univariate non-graphical methods: Quantitative data

Numerical variables: values or observations that can be measured, and these
numerical values can be placed in ascending or descending order.
É Examples: salary, height, weight, etc.

The characteristics of the population distribution of a numerical variable
É center tendency: “location” of a distribution, dealing with typical or middle
values.

É spread: an indicator of how far away from the centre we are still likely to find
data values.

É shape: Skewness and Kurtosis
É outliers: values that are outside of the areas of a distribution that would
commonly occur.

(Monash) FIT5196 16 / 24

Exploratory Data Analysis

Univariate non-graphical methods: Quantitative data

The characteristics of the population distribution of a numerical variable
É center tendency

− Mean: the arithmetic average of a set of values

µ =
x1 + x2 + · · ·+ xn

n
− Median: the middle value after all the values are put in an order list.
− Mode: the most frequent occurring value in a set of values

(Monash) FIT5196 17 / 24

Exploratory Data Analysis

Univariate non-graphical methods: Quantitative data

The characteristics of the population distribution of a numerical variable
É center tendency

This figure is from “Summary Statistics”

(Monash) FIT5196 17 / 24

Exploratory Data Analysis

Univariate non-graphical methods: Quantitative data

The characteristics of the population distribution of a numerical variable
É spread

− Range: the difference between the smallest and largest values in the data set.
− Standard Deviation and Variance

σ2 =

∑n
i=1(xi − x̄)

n − 1

This figure is from Wikipedia
(Monash) FIT5196 17 / 24

Exploratory Data Analysis

Univariate non-graphical methods: Quantitative data

The characteristics of the population distribution of a numerical variable
É spread

− Range: the difference between the smallest and largest values in the data set.
− Standard Deviation and Variance

σ2 =

∑n
i=1(xi − x̄)

n − 1
− The Interquartile Range:

Q0 the minimum
Q1 bigger than 25% of the data points
Q2 the median
Q3 bigger than 75% of the data points
Q4 the maximum

− The inter-quartile range (IQR):

IQR = Q3−Q1

(Monash) FIT5196 17 / 24

Exploratory Data Analysis

Univariate non-graphical methods: Quantitative data

The characteristics of the population distribution of a numerical variable
É The output of Pandas describe() function

(Monash) FIT5196 17 / 24

Exploratory Data Analysis

Univariate graphical methods

Histograms: a quick way of learning the characteristics of your data,
including central tendency, spread, shape, outliers, etc.

Boxplots (or Box-and-Whiskers Plot): display five-point summaries and
potential outliers in graphical form

(Monash) FIT5196 18 / 24

Exploratory Data Analysis

Univariate graphical methods

Histograms: a quick way of learning the characteristics of your data,
including central tendency, spread, shape, outliers, etc.

(Monash) FIT5196 18 / 24

Exploratory Data Analysis

Univariate graphical methods

Histograms: a quick way of learning the characteristics of your data,
including central tendency, spread, shape, outliers, etc.

(Monash) FIT5196 18 / 24

Exploratory Data Analysis

Univariate graphical methods

Boxplots (or Box-and-Whiskers Plot): display five-point summaries and
potential outliers in graphical form

(Monash) FIT5196 18 / 24

Exploratory Data Analysis

Univariate graphical methods

Boxplots (or Box-and-Whiskers Plot): display five-point summaries and
potential outliers in graphical form

(Monash) FIT5196 18 / 24

Exploratory Data Analysis

Multivariable non-graphical methods: Categorical data

Cross-tabulation: a two-way table with column headings that match the
levels of one variable and row headings that match the levels of the other
variable, then filling in the counts of all subjects that share a pair of levels.

(Monash) FIT5196 19 / 24

Exploratory Data Analysis

Multivariable non-graphical methods: Quantitative variables

Covariance: measures how much two variables “co-vary”, i.e., how much
(and in what direction) should we expect one variable to change when the
other changes.

Cov(X ,Y ) =

∑n
i=1(xi − x̄)(yi − ȳ)

n − 1
Note Cov(X ,X ) = Var(X ).

Correlation:

Cor(X ,Y ) =
Cov(X ,Y )
σxσy

É value range between -1 and +1,
É -1 being a perfect negative linear correlation,
É +1 being a perfect positive linear correlation,
É and 0 indicating that X and Y are uncorrelated.

(Monash) FIT5196 20 / 24

Exploratory Data Analysis

Multivariable non-graphical methods: correlation matrix

(Monash) FIT5196 21 / 24

Exploratory Data Analysis

Multivariable graphical methods

Side-by-Side boxplot

(Monash) FIT5196 22 / 24

Exploratory Data Analysis

Multivariable graphical methods

Side-by-Side boxplot

This figure is from “R and Data Mining: Examples and Case Studies”

(Monash) FIT5196 22 / 24

Exploratory Data Analysis

Multivariable graphical methods

Scatterplot

This figure is from “R and Data Mining: Examples and Case Studies”

(Monash) FIT5196 23 / 24

Exploratory Data Analysis

Multivariable graphical methods

Scatterplot

This figure is from https://onlinecourses.science.psu.edu/stat100/node/36

(Monash) FIT5196 23 / 24

Summary

1 What we discussed
É Data anomalies
É Exploratory Data Analysis

2 Python Plot Tutorial:
É http://pandas.pydata.org/pandas-docs/stable/visualization.html
É http://matplotlib.org/users/tutorials.html

3 Attend tutorial for week 6 in next week
4 Assessment 1 is due this week (2 Sept)!

(Monash) FIT5196 24 / 24

Data Anomalies
Exploratory Data Analysis
Summary

Related Posts