Semester 2, 2021
Lecture 5: Data quality and pre-processing
Why is pre-processing needed?
Copyright By PowCoder代写 加微信 powcoder
Date of Birth
20 years ago
13th Feb. 2019
Mike___Moore
Why is pre-processing needed?
Date of Birth
20 years ago
13th Feb. 2019
Measuring data quality • Accuracy
• Correct or wrong, accurate or not • Completeness
• Not recorded, unavailable • Consistency
• E.g. discrepancies in representation • Timeliness
• Updated in a timely way • Believability
• Do I trust the data is true?
• Interpretability
• How easily can I understand the data?
1 2 3 4 5 6 7
Inconsistent data
• Different naming representations (“Melbourne University” versus “University of Melbourne”) or (“three” versus “3”)
• Different date formats (“3/4/2016” versus “3rd April 2016”)
• Age=20, Birthdate=“1/1/1971”
• Two students with the same student id
• Outliers (e.g. 62,72,75,75,78,80,82,84,86,87,87,89,89,90,999)
• No good if it is list of ages of hospital patients
• Might be ok for a listing of people number of contacts on Linkedin though • Can use automated techniques, but also need domain knowledge
Missing or incomplete data
• Lacking feature values • Name=“”
• Age=null
• Some types of missing data (Rubin 1976) • Missing completely at random
• Missing at random
• Missing not at random
Missing completely at random (MCAR)
Missing data are MCAR when the probability of missing data on a variable is unrelated to any other measured variable and is unrelated to the variable with missing values itself.
Missing at random (MAR)
Missing data are MAR when the probability of missing data on a variable is related to other fully measured variables.
Missing not at random (MNAR)
Data are MNAR when the missing values on a variable are related to the values of that variable itself, even after controlling for other variables.
For example, when data are missing on IQ and only the people with low IQ values have missing observations for this variable.
Too much missing data!
Simple Missing-Data Strategies
Simple strategies to retain all data instances/records
• Statistical Measurement Imputation • Mean
• Median • Mode
Simple Strategies with sklearn
Simple Strategies with sklearn
Simple Missing-Data Strategies
Simple strategies to retain all data instances/records
• Statistical Measurement Imputation • Mean
• Median • Mode
Pro: easy to compute and administer; no loss of records Con: biases other statistical measurements
• variance
• standard deviation
Multivariate Strategies
Multivariate (more than one variable) 1. Logical rules
Disguised missing data
• Everyone’s birthday is January 1st?
• Email address is
• Adriaans and Zantige
• “Recently, a colleague rented a car in the USA. Since he was Dutch, his post- code did not fit the fields of the computer program. The car hire representative suggested that she use the zip code of the rental office instead.”
• How to handle
• Look for “unusual” or suspicious values in the dataset, using knowledge about the domain
Major data preprocessing activities
Data mining concepts and techniques, Han et al 2012
Data cleaning – process
• Many tools exist (Google Refine, Kettle, Talend, …)
• Data scrubbing
• Data discrepancy detection
• Data auditing
• ETL (Extract Transform Load) tools: users specify transformations via a graphical interface
• Our emphasis will be to understand some of the methods employed by typical tools
• Domain knowledge is important
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com