CS计算机代考程序代写 data science database 9/7/2021

9/7/2021

1

Introduction to Data Science

Lecture 6
Dirty Data,

Essentials of Data Cleaning and Integration

CIS 5930/4930 – Fall 2021

Dirty Data

CIS 5930 – Fall 2020

Dirty Data

CIS 5930 – Fall 2020

Dirty Data

CIS 5930 – Fall 2020

Dirty Data

CIS 5930 – Fall 2020

Dirty Data

CIS 5930 – Fall 2020

1 2

3 4

5 6

9/7/2021

2

Dirty Data

CIS 5930 – Fall 2020

Variety of Dirty Data

CIS 5930 – Fall 2020

• Naming conventions (RI, Rhode Island)
• Missing values (NULL)
• Typos (Providnce)
• Duplicates

CIS 5930 – Fall 2020

• Parsing input data (e.g., separator issues)
• Formatting issues – esp. dates
• Missing values and required fields (e.g., always
use 0)
• Different representations (2 vs Two)
• Fields too long (get truncated)
• Primary key violations (from data merging)
• Redundant Records (from data merging)

Variety of Dirty Data Why?

CIS 5930 – Fall 2020

• Data is dirty on its own
• Data sets are clean on their own but combining
them introduces errors (e.g. duplicates, different
naming conventions)
• Data doesn’t “age well” (inflation, restricting)
• Any combination of the above

Data Cleaning

CIS 5930 – Fall 2020

• Look at your data!
• Examine critically your query results
• Set (sensible) defaults
• Remove outliers
• Machine learn some of the things

Examining Query Results

CIS 5930 – Fall 2020

7 8

9 10

11 12

9/7/2021

3

Outliers

CIS 5930 – Fall 2020

String Similarity

CIS 5930 – Fall 2020

• Edit distance
• Minimal number of edits (insertions,

deletions, substitutions) to transform
string A into string B

https://en.wikipedia.org/wiki/Levenshtein_distance

Edit Distance

CIS 5930 – Fall 2020

Edit Distance

CIS 5930 – Fall 2020

String Similarity

CIS 5930 – Fall 2020

• Jaccard similarity

https://en.wikipedia.org/wiki/Jaccard_index

Jaccard Similarity

CIS 5930 – Fall 2020

13 14

15 16

17 18

9/7/2021

4

Jaccard Similarity

CIS 5930 – Fall 2020

Jaccard Similarity

CIS 5930 – Fall 2020

Data Integration

1. Enterprise Information Integration:
making separate DB’s, all owned by one
company, work together.

2. Scientific DB’s, e.g., genome DB’s.

3. Catalog integration: combining product
information from all your suppliers.

Challenges

1. DB’s get used for many applications.

You can’t change its structure for the sake of one
application, because it will cause others to
break.

2. Incompatibilities : Two, supposedly similar
databases, will mismatch in many ways.

Examples: Incompatibilities

• Lexical : addr in one DB is address in
another.

• Value mismatches : Is 20 degrees Fahrenheit
or Centigrade?

• Semantic : are “employees” in each database
the same? What about consultants?
Retirees? Contractors?

What Do You Do About It?

• Handwritten translation at each interface

• Wrapper (aka “adapter”) translates incoming
queries and outgoing answers.

19 20

21 22

23 24

9/7/2021

5

Integration Architectures

1. Federation : everybody talks directly to
everyone else.

2. Warehouse : Sources are translated from
their local schema to a global schema and
copied to a central DB.

3. Mediator : Virtual warehouse – turns a
user query into a sequence of source
queries.

Federations

Wrapper

Wrapper

Wrapper

Wrapper

Wrapper

Wrapper

Warehouse Diagram

Warehouse

Wrapper Wrapper

Source 1 Source 2

A Mediator

Mediator

Wrapper Wrapper

Source 1 Source 2

User query

Query

Query

QueryQuery

Result

Result

Result

Result

Result

25 26

27 28