Data quality and pre-processing – I
School of Computing and Information Systems
@University of Melbourne 2022
Copyright By PowCoder代写 加微信 powcoder
Why is pre-processing needed?
Date of Birth
20 years ago
13th Feb. 2019
Mike___Moore
COMP20008 Elements of Data Processing
Data quality
Measuring data quality • Accuracy
• Correct or wrong, accurate or not • Completeness
• Not recorded, unavailable • Consistency
• E.g. discrepancies in representation • Timeliness
• Updated in a timely way • Believability
• Do I trust the data is true? • Interpretability
1 2 3 4 5 6
Date of Birth
20 years ago
13th Feb. 2019
• How easily can I understand the data?
COMP20008 Elements of Data Processing
Inconsistent data
• Different naming representations (“Melbourne University” versus “University of Melbourne”) or (“three” versus “3”)
• Different date formats (“3/4/2016” versus “3rd April 2016”)
• Age=20, Birthdate=“1/1/1971”
• Two students with the same student id
• Outliers (e.g. 62,72,75,75,78,80,82,84,86,87,87,89,89,90,999)
• No good if it is list of ages of hospital patients
• Might be ok for a listing of people number of contacts on Linkedin though
• Can use automated techniques, but also need domain knowledge COMP20008 Elements of Data Processing
Data quality and pre-processing – II
School of Computing and Information Systems
@University of Melbourne 2022
Missing or incomplete data
• Lacking feature values • Name=“”
• Age=null
• Some types of missing data (Rubin 1976) • Missing completely at random
• Missing at random
• Missing not at random
COMP20008 Elements of Data Processing
Missing completely at random (MCAR)
Missing data are MCAR when the probability of missing data on a variable is unrelated to any other measured variable and is unrelated to the variable with missing values itself.
COMP20008 Elements of Data Processing
Missing at random (MAR)
Missing data are MAR when the probability of missing data on a variable is related to other fully measured variables.
COMP20008 Elements of Data Processing
Missing not at random (MNAR)
Data are MNAR when the missing values on a variable are related to the values of that variable itself, even after controlling for other variables.
For example, when data are missing on IQ and only the people with low IQ values have missing observations for this variable.
COMP20008 Elements of Data Processing
Too much missing data!
COMP20008 Elements of Data Processing
Simple Missing-Data Strategies
Simple strategies to retain all data instances/records
• Statistical Measurement Imputation • Mean
• Median • Mode
COMP20008 Elements of Data Processing
Simple Strategies with sklearn
COMP20008 Elements of Data Processing
Simple Strategies with sklearn
COMP20008 Elements of Data Processing
Simple Missing-Data Strategies
Simple strategies to retain all data instances/records
• Statistical Measurement Imputation • Mean
• Median • Mode
Pro: easy to compute and administer; no loss of records Con: biases other statistical measurements
• variance
• standard deviation
COMP20008 Elements of Data Processing
Multivariate Strategies
Multivariate (more than one variable) 1. Logical rules
COMP20008 Elements of Data Processing
Disguised missing data
• Everyone’s birthday is January 1st?
• Email address is
• Adriaans and Zantige
• “Recently, a colleague rented a car in the USA. Since he was Dutch, his post-code did not fit the fields of the computer program. The car hire representative suggested that she use the zip code of the rental office instead.”
• How to handle
• Look for “unusual” or suspicious values in the dataset, using knowledge about the domain
COMP20008 Elements of Data Processing
Scale numeric data
• a crucial pre-processing step in the pipeline
• Normalisation
– Rescale the values to be between 0 and 1 – (# − #!”#)/(#!$% − #!”#)
• Standardisation
– Rescale the values with a 0 mean and standard deviation of 1 – (# − ‘%)/(%
COMP20008 Elements of Data Processing
Major data preprocessing activities
COMP20008 Elements of Data Processing
Data mining concepts and techniques, Han et al 2012
Data cleaning – process
• Many tools exist (Google Refine, Kettle, Talend, …)
• Data scrubbing
• Data discrepancy detection
• Data auditing
• ETL (Extract Transform Load) tools: users specify transformations via a graphical interface
• Our emphasis will be to understand some of the methods employed by typical tools
• Domain knowledge is important
COMP20008 Elements of Data Processing
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com