Data Cleansing — 1
Data Cleansing — 1
Faculty of Information Technology, Monash University, Australia
FIT5196 week 6
(Monash) FIT5196 1 / 24
Data Wrangling Process
(Monash) FIT5196 2 / 24
Outline
1 Data Anomalies
2 Exploratory Data Analysis
3 Summary
(Monash) FIT5196 3 / 24
Data Anomalies
Data Cleansing
Data Cleansing: A process of detecting and removing errors and
inconsistencies from data in order to improve the quality of data.
(Monash) FIT5196 4 / 24
Data Anomalies
Data Cleansing
Data Cleansing: A process of detecting and removing errors and
inconsistencies from data in order to improve the quality of data.
Data Cleansing: An iterative process
Data Glitches: Monsters in Your Data 165
Define
&
Identify
Detect
&
Quantify
Clean
&
Rectify
Measure
&
Verify
-Types of Glitches
-Identify those that impact the User’s
application
-Methods and algorithms for detection
-Quantify prevalence
-Develop suitable repair strategies
-Repair the data
-Measure improvement in quality
-Verify data meet tolerance, cost
and bias criteria
Iterate
Fig. 1 Overview of data quality process: the process is flexible to allow the user to customize data
quality assessment and repair to suit resource and accuracy requirements
between cleaning and distortion, within the user’s resource limitations and quality
specifications. In this chapter, we provide a brief introduction to these concepts and
explain data quality assessment within the framework of statistical theory.
The rest of the chapter is organized as follows. Section 2 provides an overview
of the statistical data quality process, emphasizing its iterative nature. In Sect. 3,
we discuss new types of glitches and complex glitch patterns that can be leveraged
to develop efficient context-dependent cleaning strategies. In Sect. 4, we describe
methods for detecting different types of glitches, while Sect. 5 focuses on assessing
the quality of data. Section 6 offers suggestions for cleaning and repairing data.
In Sect. 7, we discuss how to choose a strategy from a multitude of potential
strategies, based on the notion of the trade-off between cleaning and distortion. A
brief literature overview is provided in Sect. 8. Finally, we present our conclusions
in Sect. 9.
2 Data Cleaning, an Iterative Process
Data cleaning is an iterative process. Typically, a user is presented with a data set
D and would like to clean it to meet cleanliness specifications (“no more than
10 % missing values”) subject to resource and cost considerations. The process
of cleaning could potentially introduce new glitches and make the data dirtier.
Therefore, after each cleaning step, the user remeasures the quality and repeats the
process until the specifications are satisfied.
A schematic version of the data quality process is shown in Fig. 1. It has four
broad stages. We give a brief overview below and describe each stage in detail in
the following sections.
An overview of data quality process, from “Data Glitches: Monsters in Your Data” by Tamraparni Dasu, in “Handbook of Data Quality” 2013
(Monash) FIT5196 4 / 24
Data Anomalies
Data Cleansing
Data Cleansing: A process of detecting and removing errors and
inconsistencies from data in order to improve the quality of data.
Data Cleansing: An iterative process
(Monash) FIT5196 4 / 24
Data Anomalies
Data Anomalies (i.e., Glitches or Errors)
Data Anomalies describes the distortion of the data because of any of the
problems that might encounter in the life cycle of data that includes its
capture, storage, update, transmission, access, archive, restore, deletion and
purge.
Some common data quality problems
É Missing data
É Inconsistent and faulty data
É Outliers
É Duplicates
(Monash) FIT5196 5 / 24
Data Anomalies
Data Anomaly Classification: source-based1
1
From “Data Cleaning: Problems and Current Approaches” by Rahm and Do
(Monash) FIT5196 6 / 24
Data Anomalies
Data Anomaly Classification: source-based1
1
From “Data Cleaning: Problems and Current Approaches” by Rahm and Do
(Monash) FIT5196 6 / 24
Data Anomalies
Data Anomaly Classification: source-based1
1
From “Data Cleaning: Problems and Current Approaches” by Rahm and Do
(Monash) FIT5196 6 / 24
Data Anomalies
Data Anomaly Classification: source-based1
1
From “Data Cleaning: Problems and Current Approaches” by Rahm and Do
(Monash) FIT5196 6 / 24
Data Anomalies
Data Anomalies Classification: type-based
Syntactical Anomalies: format and values
Semantic Anomalies: comprehensiveness and non-redundancy
Coverage Anomalies: missing values
(Monash) FIT5196 7 / 24
Data Anomalies
Data Anomalies Classification: type-based
Syntactical Anomalies: format and values
É Lexical errors: data format discrepancies in terms of database; spelling
errors, typos in terms of linguistics.
É Domain format errors: inconsistent value format of an attribute, e.g.,
Buntine, Wray Lindsay v.s. Wray L. Buntine
É Irregularities: the non-uniform use of values, units and abbreviations?e.g.,
salary in difference currencies.
Semantic Anomalies: comprehensiveness and non-redundancy
Coverage Anomalies: missing values
(Monash) FIT5196 7 / 24
Data Anomalies
Data Anomalies Classification: type-based
Syntactical Anomalies: format and values
Semantic Anomalies: comprehensiveness and non-redundancy
É Integrity constraint violations
É Contradictions: violation of dependencies between attributes, e.g., AGE and
DOB.
É Duplicates: observations representing the same entity.
É Invalid observations
Coverage Anomalies: missing values
(Monash) FIT5196 7 / 24
Data Anomalies
Data Anomalies Classification: type-based
Syntactical Anomalies: format and values
Semantic Anomalies: comprehensiveness and non-redundancy
Coverage Anomalies: missing values
É Missing values: due to omissions while collecting the data
É Missing observations:
(Monash) FIT5196 7 / 24
Data Anomalies
Taxonomy of Dirty Data2
Dirty data manifests itself in three different ways:
missing data
not missing but wrong data,
not missing and not wrong but unusable,
2Kim et al, “A Taxonomy of Dirty Data”, DMKD 2003
(Monash) FIT5196 8 / 24
Data Anomalies
Taxonomy of Dirty Data2
Dirty data manifests itself in three different ways:
missing data
É Missing data where there is no Null-not-allowed constraint
É Missing data where Null-not-allowed constraint should be enforced
not missing but wrong data,
not missing and not wrong but unusable,
2Kim et al, “A Taxonomy of Dirty Data”, DMKD 2003
(Monash) FIT5196 8 / 24
Data Anomalies
Taxonomy of Dirty Data2
Dirty data manifests itself in three different ways:
missing data
not missing but wrong data, due to
É Integrity constraints
− violation of data type constraint, including value range
− violation of non-null uniqueness constraint, i.e., duplicated data
− violation of referential integrity
− Wrong categorical data
− Outdated temporal data
− Inconsistent spatial data
É Data Entry error involving a single table
− Data entry error involving a single field: erroneous entry, misspelling,
extraneous data
− Data entry error involving multiple fields: entry into wrong fields, wrong
derived-field data
not missing and not wrong but unusable,
2Kim et al, “A Taxonomy of Dirty Data”, DMKD 2003
(Monash) FIT5196 8 / 24
Data Anomalies
Taxonomy of Dirty Data2
Dirty data manifests itself in three different ways:
missing data
not missing but wrong data,
not missing and not wrong but unusable, due to
É Different data for the same entity across multiple databases
É Ambiguous data due to: the use of abbreviation (Dr. for doctor or drive),
É Incomplete context (e.g., Sydney of Australia or Canada)
É The use of abbreviation (e.g., ste for suite, rd for road, st for street, etc)
É Alias/nick name (e.g., Bill Clinton, President Clinton)
É Encoding formats (e.g, ASCII, …)
É Representations (e.g., negative number, precision, fraction)
É Measurement units (e.g., data, time, currency, weight, area, etc.)
É Uses of special characters (e.g., space, dash, parenthesis in phone numbers)
in concatenated data
2Kim et al, “A Taxonomy of Dirty Data”, DMKD 2003
(Monash) FIT5196 8 / 24
Data Anomalies
Quality measures
Adapted from “Problems, Methods, and Challenges in Comprehensive Data Cleansing” By Muller and Fretag
(Monash) FIT5196 9 / 24
Data Anomalies
Data Anomalies: Looking for Errors
ID Landgrabbed ISO Landgrabber Base Sector Hectares Production Projected investment
Status of
deal Start End
A23
Algeria DZA Al Qudra UAE Finance 31000.00 Milk, olive oil,
potatoes
Done
06/2005 01/2015
A45
Algeria Al Qudra UAE real estate 31000.00 Milk, olive oil,
potatoes
Done
06/1905 01/2012
A3
Angola CAMC Engineering
Co. Ltd
China Construction 1500.00 Rice US$77 million Done
06/2010 05/2005
A1
Philippines Kuwait Kuwait Government 20000 Maize, rice In process
10/2015 12/1917
A34
菲律宾 Zuellig Group Malaysia Agribusiness,
health care
30000 Maize In process
06/2016 08/2020
A45
Philippines Oman Oman Government 10,000 Rice 150m Processing
06/1909 09/1917
A34
Philippines PHL Brunei Investment
Authority
Brunei Gover 10,000 Rice Proposed
03/2016
A56 Philippines China China 100,280,000 Various Suspended 02/2000 11/2001
A54
Philippines Green Future
Innovation
Japan 11,000 Sugar cane US$120 million Done
06/2014 09/2015
A4
argantin ARG Beidahuang CH 320000 Maize, soybeans, wheat US$1,500
million
Suspended
12/1900 07/1901
A65
Tanzânia
Nirmal Seeds
Indië
Agribusiness 30000 Seeds In process
03/2013 06/2016
tanzania Yes Bank India Finance 50000 Rice, wheat In process
06/2010 06/2017
A3
Tansania Export Trading
Group
Singapore Agribusiness 8000 Rice Done
12/2015 10/2018 Syntactical Typos Irreglarities
A23
Brazil BRA Clean Energy
Brazil
UK 30,000 Sugar cane Done
03/2012 09/2013 Coverage
N67
BRA Adecoagro US Agribusiness 165,000 Cattle, cofee, grains,
soybeans, sugar cane
98,000,000 Done
10/2010 07/2005
A67
brazil BRA Archer Daniels
Midland
US Agribusiness 12,000 Oil palm In process
06/2014 01/2015
A56
Brasil Black River Asset
Management
United
states
Finance 50,000 Crops 20000000 Done
02/2010 2015
(Monash) FIT5196 10 / 24
Data Anomalies
Data Anomalies: Some Errors
ID Landgrabbed ISO Landgrabber Base Sector Hectares Production Projected investment Status of deal Start End
A23 Algeria DZA Al Qudra UAE Finance 31000.00
Milk, olive oil,
potatoes Done 06/2005 01/2015
A45 Algeria Al Qudra UAE real estate 31000.00
Milk, olive oil,
potatoes Done 06/1905 01/2012
A3 Angola
CAMC Engineering
Co. Ltd China Construction 1500.00 Rice US$77 million Done 06/2010 05/2005
A1 Philippines Kuwait Kuwait Government 20000 Maize, rice In process 10/2015 12/1917
A34 菲律宾 Zuellig Group Malaysia
Agribusiness,
health care 30000 Maize In process 06/2016
A45 Philippines Oman Oman Government 10,000 Rice 150m Processing 06/1909 09/1917
A34 Philippines PHL
Brunei Investment
Authority Brunei Gover 10,000 Rice Proposed 03/2016
A56 Philippines China China 100,280,000 Various Suspended 02/2000 11/2001
A54 Philippines Green Future
Innovation
Japan 11,000 Sugar cane US$120 million Done 06/2014 09/2015
A4 argantin ARG Beidahuang CH 320000
Maize, soybeans,
wheat
US$1,500
million Suspended 12/1900 07/1901 What does it affect
A65 Tanzânia Nirmal Seeds Indië Agribusiness 30000 Seeds In process 03/2013 06/2016
tanzania Yes Bank India Finance 50000 Rice, wheat In process 06/2010
A3 Tansania Export Trading
Group
Singapore Agribusiness 8000 Rice Done 12/2015 10/2018 Systamatic Typos Irreglarities Mandatory fields
A23 Brazil BRA
Clean Energy
Brazil UK 30,000 Sugar cane Done 03/2012 09/2013 Semantic Consistency Integrity constraint validation redundanct outliers
N67 BRA Adecoagro US Agribusiness 165,000
Cattle, cofee,
grains, soybeans,
sugar cane
98,000,000 Done 10/2010 07/2005
Coverage
A67 brazil In process 06/2014
A56 Brasil
Black River Asset
Management
United
states Finance 50,000 Crops 20000000 Done 02/2010 2015
Some problems:
1- Consistency in country names Integrity Validity Cross validation & Typos
2- ID: Consistency and duplication completeness Covergae
3- Irrigularities in contry names consistency Schema confrormance Mandatory fields
4- Date: End before start Unifromity irregualrity
5- Date: End 1917 while processing density Coverage
6- Done in 2018
7- What kind of missing value in
8- Consistency in status of the deal
9- Kind of missing values on End data If it is in process, the end date is missing : MAR or MCAR or MNAR
This one is MNAR as it is missing mostly when the status is in process. The in process is an indictor in this case not a dependent variable.
10- Outlier in Hectars 100,2180,000
11- Consistency in units in projected inverstments
12- Msiing in A6 (second last). What is the best to do?
13- Base columns, abbriviation or full name?
14…. more
(Monash) FIT5196 11 / 24
Data Anomalies
Data Cleansing Blueprint
1 Data Auditing (or Analysis): detect errors and inconsistencies in the data
É Data profiling: focuses on the instance analysis of individual attributes
É Data mining: descriptive data mining, e.g., clustering, summarisation,
association discovery, etc.
2 Definition of transformation workflow: define a sequence of operations on
the data, used to detect and eliminate anomalies
É Early data cleaning steps: correct single-source instance problems
É Later data cleaning steps: deal with schema/data integration and clean
multi-source problems.
3 Verification: test and evaluate the correctness and effectiveness of a
transformation workflow
4 Data transformation: Execute the transformation steps
5 Post-processing and controlling: inspect the results to verify the correctness
of the specified operations.
(Monash) FIT5196 12 / 24
Exploratory Data Analysis
Outline
1 Data Anomalies
2 Exploratory Data Analysis
3 Summary
(Monash) FIT5196 13 / 24
Exploratory Data Analysis
Exploratory Data Analysis
Two types of variables:
É categorial variable
É numerical variable
Two types of EDA
É Non-graphical: summary statistics
É Graphical: various plots
EDA
É Univariate
É Multivariate
(Monash) FIT5196 14 / 24
Exploratory Data Analysis
Univariate non-graphical methods: Categorial data
Categorical variables: values or observations that can be sorted into groups
or categories.
É Examples: Sex, Eye colour and blood type.
The characteristics of interest for a categorial variable
É the range of values
É the frequency of occurrence for each value
É univariate non-graphical technique: tabulation of the frequencies
(Monash) FIT5196 15 / 24
Exploratory Data Analysis
Univariate non-graphical methods: Quantitative data
Numerical variables: values or observations that can be measured, and these
numerical values can be placed in ascending or descending order.
É Examples: salary, height, weight, etc.
The characteristics of the population distribution of a numerical variable
É center tendency: “location” of a distribution, dealing with typical or middle
values.
É spread: an indicator of how far away from the centre we are still likely to find
data values.
É shape: Skewness and Kurtosis
É outliers: values that are outside of the areas of a distribution that would
commonly occur.
(Monash) FIT5196 16 / 24
Exploratory Data Analysis
Univariate non-graphical methods: Quantitative data
The characteristics of the population distribution of a numerical variable
É center tendency
− Mean: the arithmetic average of a set of values
µ =
x1 + x2 + · · ·+ xn
n
− Median: the middle value after all the values are put in an order list.
− Mode: the most frequent occurring value in a set of values
(Monash) FIT5196 17 / 24
Exploratory Data Analysis
Univariate non-graphical methods: Quantitative data
The characteristics of the population distribution of a numerical variable
É center tendency
This figure is from “Summary Statistics”
(Monash) FIT5196 17 / 24
Exploratory Data Analysis
Univariate non-graphical methods: Quantitative data
The characteristics of the population distribution of a numerical variable
É spread
− Range: the difference between the smallest and largest values in the data set.
− Standard Deviation and Variance
σ2 =
∑n
i=1(xi − x̄)
2
n − 1
This figure is from Wikipedia
(Monash) FIT5196 17 / 24
Exploratory Data Analysis
Univariate non-graphical methods: Quantitative data
The characteristics of the population distribution of a numerical variable
É spread
− Range: the difference between the smallest and largest values in the data set.
− Standard Deviation and Variance
σ2 =
∑n
i=1(xi − x̄)
2
n − 1
− The Interquartile Range:
Q0 the minimum
Q1 bigger than 25% of the data points
Q2 the median
Q3 bigger than 75% of the data points
Q4 the maximum
− The inter-quartile range (IQR):
IQR = Q3−Q1
(Monash) FIT5196 17 / 24
Exploratory Data Analysis
Univariate non-graphical methods: Quantitative data
The characteristics of the population distribution of a numerical variable
É The output of Pandas describe() function
(Monash) FIT5196 17 / 24
Exploratory Data Analysis
Univariate graphical methods
Histograms: a quick way of learning the characteristics of your data,
including central tendency, spread, shape, outliers, etc.
Boxplots (or Box-and-Whiskers Plot): display five-point summaries and
potential outliers in graphical form
(Monash) FIT5196 18 / 24
Exploratory Data Analysis
Univariate graphical methods
Histograms: a quick way of learning the characteristics of your data,
including central tendency, spread, shape, outliers, etc.
(Monash) FIT5196 18 / 24
Exploratory Data Analysis
Univariate graphical methods
Histograms: a quick way of learning the characteristics of your data,
including central tendency, spread, shape, outliers, etc.
(Monash) FIT5196 18 / 24
Exploratory Data Analysis
Univariate graphical methods
Boxplots (or Box-and-Whiskers Plot): display five-point summaries and
potential outliers in graphical form
(Monash) FIT5196 18 / 24
Exploratory Data Analysis
Univariate graphical methods
Boxplots (or Box-and-Whiskers Plot): display five-point summaries and
potential outliers in graphical form
(Monash) FIT5196 18 / 24
Exploratory Data Analysis
Multivariable non-graphical methods: Categorical data
Cross-tabulation: a two-way table with column headings that match the
levels of one variable and row headings that match the levels of the other
variable, then filling in the counts of all subjects that share a pair of levels.
(Monash) FIT5196 19 / 24
Exploratory Data Analysis
Multivariable non-graphical methods: Quantitative variables
Covariance: measures how much two variables “co-vary”, i.e., how much
(and in what direction) should we expect one variable to change when the
other changes.
Cov(X ,Y ) =
∑n
i=1(xi − x̄)(yi − ȳ)
n − 1
Note Cov(X ,X ) = Var(X ).
Correlation:
Cor(X ,Y ) =
Cov(X ,Y )
σxσy
É value range between -1 and +1,
É -1 being a perfect negative linear correlation,
É +1 being a perfect positive linear correlation,
É and 0 indicating that X and Y are uncorrelated.
(Monash) FIT5196 20 / 24
Exploratory Data Analysis
Multivariable non-graphical methods: correlation matrix
(Monash) FIT5196 21 / 24
Exploratory Data Analysis
Multivariable graphical methods
Side-by-Side boxplot
(Monash) FIT5196 22 / 24
Exploratory Data Analysis
Multivariable graphical methods
Side-by-Side boxplot
This figure is from “R and Data Mining: Examples and Case Studies”
(Monash) FIT5196 22 / 24
Exploratory Data Analysis
Multivariable graphical methods
Scatterplot
This figure is from “R and Data Mining: Examples and Case Studies”
(Monash) FIT5196 23 / 24
Exploratory Data Analysis
Multivariable graphical methods
Scatterplot
This figure is from https://onlinecourses.science.psu.edu/stat100/node/36
(Monash) FIT5196 23 / 24
Summary
Summary
1 What we discussed
É Data anomalies
É Exploratory Data Analysis
2 Python Plot Tutorial:
É http://pandas.pydata.org/pandas-docs/stable/visualization.html
É http://matplotlib.org/users/tutorials.html
3 Attend tutorial for week 6 in next week
4 Assessment 1 is due this week (2 Sept)!
(Monash) FIT5196 24 / 24
Data Anomalies
Exploratory Data Analysis
Summary