Introduction to Data Wrangling
Introduction to Data Wrangling
Faculty of Information Technology
Monash University
FIT5196 week 1
(Monash) FIT5196 1 / 38
Outline
1 Motivations
2 Introduction to FIT5196 Data Wrangling
Unit structure
Assessments
Unit management
3 Introduction to Data Wrangling
Data Quality Problems
Characteristics of Tidy Data
Major Tasks in Data Wrangling
Programming Environment
4 Demonstration: Wrangling Air Crashes data with Data
Wrangler
5 Summary
(Monash) FIT5196 2 / 38
Motivations
What is Data Wrangling?
From Trifacta’s Data Wrangling practitioners,1
“We define such data wrangling as a process of iterative data exploration and
transformation that enables analysis. One goal is to make data usable — to put
them in a form that can be parsed and manipulated by analysis tools. . . . In
other words, data wrangling is the process of making data useful. Ideally, the
outcome of wrangling is not simply data; it is an editable and auditable
transcript of transformations coupled with a nuanced understanding of data
organization and data quality issues.
Two key points:
É Clean and useful data that can be used in the downstream data analysis.
É Documentation of all data manipulation performed.
1http://vis.stanford.edu/files/2011-DataWrangling-IVJ.pdf
(Monash) FIT5196 3 / 38
Motivations
What is Data Wrangling?
From Trifacta’s Data Wrangling practitioners,1
“We define such data wrangling as a process of iterative data exploration and
transformation that enables analysis. One goal is to make data usable — to put
them in a form that can be parsed and manipulated by analysis tools. . . . In
other words, data wrangling is the process of making data useful. Ideally, the
outcome of wrangling is not simply data; it is an editable and auditable
transcript of transformations coupled with a nuanced understanding of data
organization and data quality issues.
Two key points:
É Clean and useful data that can be used in the downstream data analysis.
É Documentation of all data manipulation performed.
1http://vis.stanford.edu/files/2011-DataWrangling-IVJ.pdf
(Monash) FIT5196 3 / 38
Motivations
What is Data Wrangling?
From Trifacta’s Data Wrangling practitioners,1
“We define such data wrangling as a process of iterative data exploration and
transformation that enables analysis. One goal is to make data usable — to put
them in a form that can be parsed and manipulated by analysis tools. . . . In
other words, data wrangling is the process of making data useful. Ideally, the
outcome of wrangling is not simply data; it is an editable and auditable
transcript of transformations coupled with a nuanced understanding of data
organization and data quality issues.
Two key points:
É Clean and useful data that can be used in the downstream data analysis.
É Documentation of all data manipulation performed.
1http://vis.stanford.edu/files/2011-DataWrangling-IVJ.pdf
(Monash) FIT5196 3 / 38
Motivations
What do analysts wish the data looked like?
The “census income” data set from UCI machine learning data repository.
Data analysis task: Predict whether income exceeds $50K/yr based on age,
education, marital status, native-country, etc.
Algorithms: C4.5 (Decision Tree), Naive-Bayes, Nearest Neighbours, etc.
(Monash) FIT5196 4 / 38
Motivations
What do analysts wish the data looked like?
The “credit approval” data set from UCI machine learning data repository.
Data analysis: Predict whether or not a credit card application should be
approved.
Algorithms: Decision Tree.
(Monash) FIT5196 5 / 38
Motivations
What does data really look like?
Airline Crash dataset from Wikipedia:
(Monash) FIT5196 6 / 38
Motivations
What does data really look like?
Twitter data 2
2https://dev.twitter.com/rest/reference/get/blocks/list
(Monash) FIT5196 7 / 38
Motivations
What does data really look like?
Fungal disease CT report.
(Monash) FIT5196 8 / 38
Motivations
Our goal
Raw data ⇒ Data Wrangling ⇒ Tidy data ⇒ Data Analysis ⇒ Data
Knowledge
Data + Wrangling + Analysis = Data Product (or Knowledge)
(Monash) FIT5196 9 / 38
Introduction to FIT5196 Data Wrangling
Outline
1 Motivations
2 Introduction to FIT5196 Data Wrangling
Unit structure
Assessments
Unit management
3 Introduction to Data Wrangling
4 Demonstration: Wrangling Air Crashes data with Data
Wrangler
5 Summary
(Monash) FIT5196 10 / 38
Introduction to FIT5196 Data Wrangling Unit structure
Unit Objectives
What the course is trying to achieve:
É parse data in the required format;
É assess the quality of data for problem identification;
É resolve data quality issues ready for the data analysis process;
É integrate data sources for data enrichment;
É document the wrangling process for professional reporting;
É write program scripts for data wrangling processes.
What it is not trying to achieve:
É Introduction to Python programming, e.g., how to program in Python.
You MUST be very familiar with Python and the usage of Python packages!
(Monash) FIT5196 11 / 38
Introduction to FIT5196 Data Wrangling Unit structure
Unit outline
Week topic
1 Introduction to Data Wrangling
2 Introduction to Regular Expressions
3 Parsing Raw Data in Different Formats
4 Text Data Preprocessing
5 Text Data Preprocessing
6 Data Cleansing
7 Data Cleansing
8 Data Cleansing
9 Data Integration
10 Data Integration and reshaping
11 Data Enrichment, Transformation, normalization, etc.
12 Summary
(Monash) FIT5196 12 / 38
Introduction to FIT5196 Data Wrangling Assessments
Assessments
Summary
Value Due Date Type
Assessment 1 35% Week 6 Sunday, 2 September 2018 Individual
Assessment 2 35% Week 10, Wednesday 3 October 2018 Individual
Assessment 3 30% Week 12, Sunday 21 October 2018 Individual
General criteria for marking
É The submitted code must work without any errors and must give the
correct results.
É The code should be well structured and properly commented.
É The notebook should be structured in a logical way so that it clearly shows
how students finish the tasks in the assessment.
(Monash) FIT5196 13 / 38
Introduction to FIT5196 Data Wrangling Assessments
Assessments: 1
Assessment 1: Parsing Data + Text Preprocessing
Brief description: Data extracted from different sources is often stored in
different formats. In this assessment, you are required to write Python
(either Python 2 or Python 3) script to
1 extract data from an XML, HTML, and PFD files,
2 convert data stored in an XML and HTML files to a JSON,
3 and generate sparse representation for the pdf files file.
Due date: Week 6 Sunday, 2 September 2018
(Monash) FIT5196 14 / 38
Introduction to FIT5196 Data Wrangling Assessments
Assessment details: 2
Assessment 2: Parsing and Cleansing Raw Data
Brief description: This assessment addresses one the most important steps
in data wrangling, i.e., cleansing data. Students are required to
É inspect, audit and then identify problems existing in the parsed data; and
propose appropriate methods to fix these problems;
É Different generic and major data problems could be found in the data might
include:
− Lexical errors, e.g., typos and spelling mistakes
− Irregularities, e.g., abnormal data values and data formats
− Violations of the Integrity constraint.
− Outliers
− Duplications
− Missing values
− Inconsistency, e.g., inhomogeneity in values and types in representing the same
data
Hurdle: The outcome of the interview, where students will need to
communicate their processes, justify their approaches, and answer some
questions.
Due date: Week 10, Wednesday 3 October 2018
(Monash) FIT5196 15 / 38
Introduction to FIT5196 Data Wrangling Assessments
Assessment details: 3
Assessment task 3: Data Integration and Reshaping
Brief description: This assessment focuses on data integration and
reshaping. The students are required to integrate data that might be
collected from different sources. The students need to resolve different
levels of conflicts in integration according to what we will be discussed in
the lectures. The output of this assessment should be an integrated dataset
and a Jupyter Notebook containing information about your designed global
schema and all the Python scripts used in integrating the data. Moreover,
the students will need to apply various normalization/transformation
methods to the data, and analyze how they affect the distribution of the
data.
Due date: Week 12, Sunday 21 October 2018
(Monash) FIT5196 16 / 38
Introduction to FIT5196 Data Wrangling Unit management
Unit management: CE & Lecturers
CE: Dr. Lan Du
É Contact: Lan.Du@monash.edu
É Office: R142, Building 63 – 25 Exhibition Walk, Clayton Campus
Lecturer: Mr. Mohammad Haqqani
É Contact: Mohammad.Haqqani@monash.edu
É Office: TBA
Lecturers’ consultation by appointment
(Monash) FIT5196 17 / 38
Introduction to FIT5196 Data Wrangling Unit management
Unit management: Tutors
Tutors
(Monash) FIT5196 18 / 38
Introduction to FIT5196 Data Wrangling Unit management
Unit management: Consultations
Consultation time
(Monash) FIT5196 19 / 38
Introduction to FIT5196 Data Wrangling Unit management
Unit management: Lectures & Tutorial Classes
(Monash) FIT5196 20 / 38
Introduction to Data Wrangling
Outline
1 Motivations
2 Introduction to FIT5196 Data Wrangling
3 Introduction to Data Wrangling
Data Quality Problems
Characteristics of Tidy Data
Major Tasks in Data Wrangling
Programming Environment
4 Demonstration: Wrangling Air Crashes data with Data
Wrangler
5 Summary
(Monash) FIT5196 21 / 38
Introduction to Data Wrangling
Data Wrangling: the No.1 challenge in data analysis
Challenges:
É A massive amount of data has become available and can be collected from
various sources, like mobile devices, web pages, email, ATMs, social media,
corporate databases, etc.
− Freebase dump: 250G
É Data from different sources often comes in different formats (e.g., JSON,
XML, CSV, Excel and PDF)
É Data possesses a variety of data quality issues (e.g., inconsistent values,
duplicates, missing values, and outliers)
É It is impossible to directly run any existing data analysis algorithms over raw
data.
É The whole process of data wrangling can account up to 80% of the time in
the analysis cycle.
Opportunities:
É One prediction for big data analytics: Automating and simplifying complex
data wrangling process with machine learning technologies.
− Enable enterprises, like banks and finance institutions, gain better insights and
derive greater business values from their data.
É Demands from various domains: business, finance, health informatics,
government agents. etc.
(Monash) FIT5196 22 / 38
Introduction to Data Wrangling Data Quality Problems
Data glitches — data quality problems
The real-world data is almost always incomplete, dirty and inconsistent, which
attributes to the necessity of data wrangling.
Where the data problems come from? For example,
É Manual entry errors
É Malfunction of measurement devices
É Data sources follow different conventions, formats, or data models.
Data Quality problems:
É Interpretability issue
É Data format issues
É Inconsistent and faulty data
É Missing values
É Outliers
É Duplicates
(Monash) FIT5196 23 / 38
Introduction to Data Wrangling Data Quality Problems
Data quality problems: Interpretability issue
Is your data set interpretable?
Figure: The Switzerland heart disease data set from UCI machine learning repository
Attributes in columns in order: age, sex, chest pain type, resting blood
pressure, serum cholesterol, fasting blood sugar, etc. 3
A data dictionary is needed.
3See http://archive.ics.uci.edu/ml/datasets/Heart+Disease for more details.
(Monash) FIT5196 24 / 38
Introduction to Data Wrangling Data Quality Problems
Data quality problems: Data format issues
In which format is your data stored?
JavaScript Object Notation (JSON):
Extensible Markup Language (XML)
other formats:
É CSV, Excel and PDF
(Monash) FIT5196 25 / 38
Introduction to Data Wrangling Data Quality Problems
Data quality problems: Inconsistent and faulty data
Does you data contain mis-typed, non-standard, inconsistent entries, etc.?
Mr. Mark John 33 21-08-1985 180 M 0433010010 Mel,VIC
Mr. Chris, Peter 34 21-Sep-1982 ? Fale 0000000000 Syd, NSW
Ethan Steedman 36 01/01/82 17o M 0388886789 Mel,VIC
Inconsistant date format
Age not matching date of birth
Different name formats
(Monash) FIT5196 26 / 38
Introduction to Data Wrangling Data Quality Problems
Data quality problems: Missing values
In your data set, are any data values that should be presented but absent for
some reasons?
Figure: Missing values in the Switzerland
heart disease data set are indicated by “?”.
Big issue of finding missing values:
missing values are defaulted to a
valid value of the variable itself.
É Represent missing values by zero
Missing data could result in serious
bias in the analyses.
(Monash) FIT5196 27 / 38
Introduction to Data Wrangling Data Quality Problems
Data quality problems: outliers
Is there any observation that lies an abnormal distance from the majority of the
other observations in the dataset?4
Figure: Bank transaction data, where x:
hours since last transaction, y: transaction
amount
Outliers can be either bad or
interesting:
É Finance institutions might be
interested in identifying
transactions that do not behave
in a normal way.
− High value transactions are
occurring on inactive accounts
4Figure is from http://blog.easysol.net/advanced-outlier-detection/
(Monash) FIT5196 28 / 38
Introduction to Data Wrangling Data Quality Problems
Data quality problems: outliers
Is there any observation that lies an abnormal distance from the majority of the
other observations in the dataset?4
Figure: Outliers identifies by the Local
Outlier Factor (LOF) method.
Outliers can be either bad or
interesting:
É Finance institutions might be
interested in identifying
transactions that do not behave
in a normal way.
− High value transactions are
occurring on inactive accounts
4Figure is from http://blog.easysol.net/advanced-outlier-detection/
(Monash) FIT5196 28 / 38
Introduction to Data Wrangling Data Quality Problems
Data quality problems: Duplicates
In your data, are there multiple entries that actually corresponds to the same
piece of information?
For example, in a database where DOB and mobile phone number can
uniquely identify an individual and the attribute of interest is height, the two
entries above are duplicates.
(Monash) FIT5196 29 / 38
Introduction to Data Wrangling Data Quality Problems
Data quality problems: Consequences
“Data quality issues can seriously skew the results of data mining and analysis,
with consequences that can potentially cost billions; corporations could make
erroneous decisions on misleading results, and machinery could be incorrectly
calibrated leading to disastrous failures.” — by Dasu. 5
Example
A credit card company is interested in predicting whether an individual will
default on his or her credit payment.
The company will pay for a large price for misclassifying defaulters as
non-defaulters due to the data quality problems.
5“Data Glitches: Monsters in Your Data” in “Handbook of Data Quality” 2013
(Monash) FIT5196 30 / 38
Introduction to Data Wrangling Characteristics of Tidy Data
Tidy data
Data Structure: most statistical datasets are rectangular tables made up of
rows and columns.
Data semantics
É A dataset is a collection of Values.
É A variable contains all values that measure the same underlying attribute.
É An observation contains all values measured on the same unit (like a patient)
(Monash) FIT5196 31 / 38
Introduction to Data Wrangling Characteristics of Tidy Data
Tidy data
A dataset is messy or tidy depending on how rows, columns and types are
matched up with observations, variables and tables.6
É Each variable forms a column
É Each observation forms a row
É Each type of observational unit forms a table
É If you have multiple tables, they should include a column in the table that
allows them to be linked.
6See “Tidy Data” By Hadley Wickham, published in Journal of Statistical Software, 2014
(Monash) FIT5196 31 / 38
Introduction to Data Wrangling Major Tasks in Data Wrangling
Major Task in Data Wrangling
Data acquisition
Data loading & extracting
Data
cleaning
Data
integration
Data
enrichment
Data profiling
Data storing
Data acquisition: Gather data from
different resources, e.g., the web,
sensors, and conventional
databases via API requests (e.g.,
Twitter’s API and Google API),
web scraping (acquiring data from
the Internet through many ways
other than API access), etc. Tools
used include various python
package, pandas, R, etc.
(Monash) FIT5196 32 / 38
Introduction to Data Wrangling Major Tasks in Data Wrangling
Major Task in Data Wrangling
Data acquisition
Data loading & extracting
Data
cleaning
Data
integration
Data
enrichment
Data profiling
Data storing
Data loading & extracting: Load
and parse data stored in many
different formats, like XML, JSON,
CSV, natural language text, etc.
Tools used include, for instance,
BeautifulSoup (one of many python
packages for parsing XML/HTML),
regular expressions, NLTK (a
python package for natural
language processing).
(Monash) FIT5196 32 / 38
Introduction to Data Wrangling Major Tasks in Data Wrangling
Major Task in Data Wrangling
Data acquisition
Data loading & extracting
Data
cleaning
Data
integration
Data
enrichment
Data profiling
Data storing
Data cleaning: Diagnose and
handle various data quality
problems. Performing data cleaning
we need a set of operations that
impute missing values, resolve
inconsistencies, identify/remove
outliers, unify data formats and
other problems discussed in
previous slides.
(Monash) FIT5196 32 / 38
Introduction to Data Wrangling Major Tasks in Data Wrangling
Major Task in Data Wrangling
Data acquisition
Data loading & extracting
Data
cleaning
Data
integration
Data
enrichment
Data profiling
Data storing
Data integration: Merge data from
different resources to create a rich
and complete data set. It involves
a set of operations that resolve
related issues, such as data
duplication, entity matching, and
schema matching.
(Monash) FIT5196 32 / 38
Introduction to Data Wrangling Major Tasks in Data Wrangling
Major Task in Data Wrangling
Data acquisition
Data loading & extracting
Data
cleaning
Data
integration
Data
enrichment
Data profiling
Data storing
Data profiling: Utilises different
kinds of descriptive statistics and
visualisation tools to improve data
quality. The data profiling process
might uncover more data quality
problems, and suggests more
operations for data cleaning and
data integration.
(Monash) FIT5196 32 / 38
Introduction to Data Wrangling Major Tasks in Data Wrangling
Major Task in Data Wrangling
Data acquisition
Data loading & extracting
Data
cleaning
Data
integration
Data
enrichment
Data profiling
Data storing
Data enrichment: Enrich existing
data by feature generation, data
transformation, data aggregation
and data reduction, etc.
(Monash) FIT5196 32 / 38
Introduction to Data Wrangling Major Tasks in Data Wrangling
Major Task in Data Wrangling
Data acquisition
Data loading & extracting
Data
cleaning
Data
integration
Data
enrichment
Data profiling
Data storing
Data storing: Finally store the
clean data in various formats,
which are easily accessible by
downstream analysis tools.
(Monash) FIT5196 32 / 38
Introduction to Data Wrangling Major Tasks in Data Wrangling
Major Task in Data Wrangling
Data acquisition
Data loading & extracting
Data
cleaning
Data
integration
Data
enrichment
Data profiling
Data storing
Documenting the process: We
should also keep a detailed
description of all data
manipulations applied in the above
tasks and generate a proper code
book that describes each variable
and its values in the clean data.
É Why documentation:
collaboration
É Collaboration tools: Jupyter
notebook, Github, etc.
(Monash) FIT5196 32 / 38
Introduction to Data Wrangling Programming Environment
Programming language & environment: Python + Jupyter Notebook
Programming language: Python 2.7 or 3.6
É A scripting language that is easy to get started with and it also comes with a
large number of libraries that can be used in data wrangling tasks
É Major libraries used in this units include (but not limited to)
− Pandas: a library that provides high-level data structures and manipulation
tools that are designed to make data processing fast and easy in Python
− NLTK: a platform for building Python programs to work with human language
data
− BeautifulSoup: a simple and efficient library for navigating, searching, and
modifying HTML and XML documents.
− Scipy: a fundamental library for scientific computing.
− scikit-learn: an efficient Python library for data mining and data analysis.
Programming environment: Jupyter notebook
É The Jupyter Notebook is a web application that allows you to create and
share documents that contain live code, equations, visualisations and
explanatory text.
(Monash) FIT5196 33 / 38
Introduction to Data Wrangling Programming Environment
Programming language & environment: Python + Jupyter Notebook
Dual Python environments
É Conda Managing environments:
https://conda.io/docs/user-guide/tasks/manage-environments.html
1 conda create -n python3 python=3.6
2 conda install nb_conda
Most of the notebooks will run in both versions.
(Monash) FIT5196 34 / 38
Demonstration: Wrangling Air Crashes data with Data Wrangler
Wrangling Air Crashes data with Data Wrangler
Data set: Air Crashes data downloaded from Wikipedia
Application: Data Wrangler from http://vis.stanford.edu/wrangler/
Goal: wrangle the data so that each row corresponds to one crash disaster,
each column is one variable.
(Monash) FIT5196 35 / 38
Demonstration: Wrangling Air Crashes data with Data Wrangler
Wrangling Air Crashes data with Data Wrangler
(Monash) FIT5196 36 / 38
Demonstration: Wrangling Air Crashes data with Data Wrangler
Wrangling Air Crashes data with Data Wrangler
Demonstration with Data Wrangler!
(Monash) FIT5196 37 / 38
Summary
Summary: what to do this week
Please please download and read materials provided in Moodle.
Set up your programming environment by installing Anaconda Python 2 or
Python 3 distribution. (Suggestion!)
Attend tutorial 1 in week 2.
Last but not least
É Choose FIT5196 wisely.
É Use the discussion forum in a proper way and with respect!
(Monash) FIT5196 38 / 38
Motivations
Introduction to FIT5196 Data Wrangling
Unit structure
Assessments
Unit management
Introduction to Data Wrangling
Data Quality Problems
Characteristics of Tidy Data
Major Tasks in Data Wrangling
Programming Environment
Demonstration: Wrangling Air Crashes data with Data Wrangler
Summary