9/7/2021
1
Introduction to Data Science
Lecture 6
Dirty Data,
Essentials of Data Cleaning and Integration
CIS 5930/4930 – Fall 2021
Dirty Data
CIS 5930 – Fall 2020
Dirty Data
CIS 5930 – Fall 2020
Dirty Data
CIS 5930 – Fall 2020
Dirty Data
CIS 5930 – Fall 2020
Dirty Data
CIS 5930 – Fall 2020
1 2
3 4
5 6
9/7/2021
2
Dirty Data
CIS 5930 – Fall 2020
Variety of Dirty Data
CIS 5930 – Fall 2020
• Naming conventions (RI, Rhode Island)
• Missing values (NULL)
• Typos (Providnce)
• Duplicates
CIS 5930 – Fall 2020
• Parsing input data (e.g., separator issues)
• Formatting issues – esp. dates
• Missing values and required fields (e.g., always
use 0)
• Different representations (2 vs Two)
• Fields too long (get truncated)
• Primary key violations (from data merging)
• Redundant Records (from data merging)
Variety of Dirty Data Why?
CIS 5930 – Fall 2020
• Data is dirty on its own
• Data sets are clean on their own but combining
them introduces errors (e.g. duplicates, different
naming conventions)
• Data doesn’t “age well” (inflation, restricting)
• Any combination of the above
Data Cleaning
CIS 5930 – Fall 2020
• Look at your data!
• Examine critically your query results
• Set (sensible) defaults
• Remove outliers
• Machine learn some of the things
Examining Query Results
CIS 5930 – Fall 2020
7 8
9 10
11 12
9/7/2021
3
Outliers
CIS 5930 – Fall 2020
String Similarity
CIS 5930 – Fall 2020
• Edit distance
• Minimal number of edits (insertions,
deletions, substitutions) to transform
string A into string B
https://en.wikipedia.org/wiki/Levenshtein_distance
Edit Distance
CIS 5930 – Fall 2020
Edit Distance
CIS 5930 – Fall 2020
String Similarity
CIS 5930 – Fall 2020
• Jaccard similarity
https://en.wikipedia.org/wiki/Jaccard_index
Jaccard Similarity
CIS 5930 – Fall 2020
13 14
15 16
17 18
9/7/2021
4
Jaccard Similarity
CIS 5930 – Fall 2020
Jaccard Similarity
CIS 5930 – Fall 2020
Data Integration
1. Enterprise Information Integration:
making separate DB’s, all owned by one
company, work together.
2. Scientific DB’s, e.g., genome DB’s.
3. Catalog integration: combining product
information from all your suppliers.
Challenges
1. DB’s get used for many applications.
You can’t change its structure for the sake of one
application, because it will cause others to
break.
2. Incompatibilities : Two, supposedly similar
databases, will mismatch in many ways.
Examples: Incompatibilities
• Lexical : addr in one DB is address in
another.
• Value mismatches : Is 20 degrees Fahrenheit
or Centigrade?
• Semantic : are “employees” in each database
the same? What about consultants?
Retirees? Contractors?
What Do You Do About It?
• Handwritten translation at each interface
• Wrapper (aka “adapter”) translates incoming
queries and outgoing answers.
19 20
21 22
23 24
9/7/2021
5
Integration Architectures
1. Federation : everybody talks directly to
everyone else.
2. Warehouse : Sources are translated from
their local schema to a global schema and
copied to a central DB.
3. Mediator : Virtual warehouse – turns a
user query into a sequence of source
queries.
Federations
Wrapper
Wrapper
Wrapper
Wrapper
Wrapper
Wrapper
Warehouse Diagram
Warehouse
Wrapper Wrapper
Source 1 Source 2
A Mediator
Mediator
Wrapper Wrapper
Source 1 Source 2
User query
Query
Query
QueryQuery
Result
Result
Result
Result
Result
25 26
27 28