CS计算机代考程序代写 data mining COMP20008

COMP20008
Elements of Data Processing
Semester 1, 2021
Lecture 5: Data quality and pre-processing
© University of Melbourne 2021

Why is pre-processing needed?
Name
Age
Date of Birth
“Henry”
20.2
20 years ago
Katherine
Forty-one
20/11/66
Michelle
37
5/20/79
Oscar@!!
“5”
13th Feb. 2019

42

Mike___Moore
669

巴拉克奥巴马
52
1961年8月4日
© University of Melbourne 2021

Why is pre-processing needed?
Date of Birth
20 years ago
20/11/66
5/20/79
13th Feb. 2019


1961年8月4日
Measuring data quality • Accuracy
• Correct or wrong, accurate or not • Completeness
• Not recorded, unavailable • Consistency
• E.g. discrepancies in representation • Timeliness
• Updated in a timely way • Believability
• Do I trust the data is true? • Interpretability
1 2 3 4 5 6
• How easily can I understand the data?
© University of Melbourne 2021
7

Inconsistent data
• Different naming representations (“Melbourne University” versus “University of Melbourne”) or (“three” versus “3”)
• Different date formats (“3/4/2016” versus “3rd April 2016”)
• Age=20, Birthdate=“1/1/1971”
• Two students with the same student id
• Outliers (e.g. 62,72,75,75,78,80,82,84,86,87,87,89,89,90,999)
• No good if it is list of ages of hospital patients
• Might be ok for a listing of people number of contacts on Linkedin though • Can use automated techniques, but also need domain knowledge
© University of Melbourne 2021

Missing or incomplete data
• Lacking feature values • Name=“”
• Age=null
• Some types of missing data (Rubin 1976) • Missing completely at random
• Missing at random
• Missing not at random
© University of Melbourne 2021

Missing completely at random (MCAR)
Missing data are MCAR when the probability of missing data on a variable is unrelated to any other measured variable and is unrelated to the variable with missing values itself.
© University of Melbourne 2021

Missing at random (MAR)
Missing data are MAR when the probability of missing data on a variable is related to other fully measured variables.
Gender
Weight
Male
81
Female
50
Male
Male
Female
66
Male
Male
75
Male
Female
© University of Melbourne 2021

Missing not at random (MNAR)
Data are MNAR when the missing values on a variable are related to the values of that variable itself, even after controlling for other variables.
For example, when data are missing on IQ and only the people with low IQ values have missing observations for this variable.
© University of Melbourne 2021

Too much missing data!
© University of Melbourne 2021

Simple Missing-Data Strategies
Simple strategies to retain all data instances/records
• Statistical Measurement Imputation • Mean
• Median • Mode
© University of Melbourne 2021

Simple Strategies with sklearn
© University of Melbourne 2021

Simple Strategies with sklearn
© University of Melbourne 2021

Simple Missing-Data Strategies
Simple strategies to retain all data instances/records
• Statistical Measurement Imputation • Mean
• Median • Mode
Pro: easy to compute and administer; no loss of records Con: biases other statistical measurements
• variance
• standard deviation
© University of Melbourne 2021

Multivariate Strategies
Multivariate (more than one variable) 1. Logical rules
2. Model
© University of Melbourne 2021

Disguised missing data
• Everyone’s birthday is January 1st?
• Email address is xx@xx.com
• Adriaans and Zantige
• “Recently, a colleague rented a car in the USA. Since he was Dutch, his post- code did not fit the fields of the computer program. The car hire representative suggested that she use the zip code of the rental office instead.”
• How to handle
• Look for “unusual” or suspicious values in the dataset, using knowledge about the domain
© University of Melbourne 2021

Major data preprocessing activities
© University of Melbourne 2021
Data mining concepts and techniques, Han et al 2012

Data cleaning – process
• Many tools exist (Google Refine, Kettle, Talend, …)
• Data scrubbing
• Data discrepancy detection
• Data auditing
• ETL (Extract Transform Load) tools: users specify transformations via a graphical interface
• Our emphasis will be to understand some of the methods employed by typical tools
• Domain knowledge is important
© University of Melbourne 2021

Break
© University of Melbourne 2021