程序代写代做代考 data mining python database algorithm finance data structure Excel Java decision tree Hive javascript Introduction to Data Wrangling

Introduction to Data Wrangling

Introduction to Data Wrangling

Faculty of Information Technology
Monash University

FIT5196 week 1

(Monash) FIT5196 1 / 38

Outline

1 Motivations

2 Introduction to FIT5196 Data Wrangling
Unit structure
Assessments
Unit management

3 Introduction to Data Wrangling
Data Quality Problems
Characteristics of Tidy Data
Major Tasks in Data Wrangling
Programming Environment

4 Demonstration: Wrangling Air Crashes data with Data
Wrangler

5 Summary

(Monash) FIT5196 2 / 38

Motivations

What is Data Wrangling?

From Trifacta’s Data Wrangling practitioners,1

“We define such data wrangling as a process of iterative data exploration and
transformation that enables analysis. One goal is to make data usable — to put
them in a form that can be parsed and manipulated by analysis tools. . . . In
other words, data wrangling is the process of making data useful. Ideally, the
outcome of wrangling is not simply data; it is an editable and auditable
transcript of transformations coupled with a nuanced understanding of data
organization and data quality issues.

Two key points:
É Clean and useful data that can be used in the downstream data analysis.
É Documentation of all data manipulation performed.

1http://vis.stanford.edu/files/2011-DataWrangling-IVJ.pdf
(Monash) FIT5196 3 / 38

Motivations

What is Data Wrangling?

From Trifacta’s Data Wrangling practitioners,1

“We define such data wrangling as a process of iterative data exploration and
transformation that enables analysis. One goal is to make data usable — to put
them in a form that can be parsed and manipulated by analysis tools. . . . In
other words, data wrangling is the process of making data useful. Ideally, the
outcome of wrangling is not simply data; it is an editable and auditable
transcript of transformations coupled with a nuanced understanding of data
organization and data quality issues.

Two key points:
É Clean and useful data that can be used in the downstream data analysis.
É Documentation of all data manipulation performed.

1http://vis.stanford.edu/files/2011-DataWrangling-IVJ.pdf
(Monash) FIT5196 3 / 38

Motivations

What is Data Wrangling?

From Trifacta’s Data Wrangling practitioners,1

“We define such data wrangling as a process of iterative data exploration and
transformation that enables analysis. One goal is to make data usable — to put
them in a form that can be parsed and manipulated by analysis tools. . . . In
other words, data wrangling is the process of making data useful. Ideally, the
outcome of wrangling is not simply data; it is an editable and auditable
transcript of transformations coupled with a nuanced understanding of data
organization and data quality issues.

Two key points:
É Clean and useful data that can be used in the downstream data analysis.
É Documentation of all data manipulation performed.

1http://vis.stanford.edu/files/2011-DataWrangling-IVJ.pdf
(Monash) FIT5196 3 / 38

Motivations

What do analysts wish the data looked like?

The “census income” data set from UCI machine learning data repository.
Data analysis task: Predict whether income exceeds $50K/yr based on age,
education, marital status, native-country, etc.
Algorithms: C4.5 (Decision Tree), Naive-Bayes, Nearest Neighbours, etc.

(Monash) FIT5196 4 / 38

Motivations

What do analysts wish the data looked like?

The “credit approval” data set from UCI machine learning data repository.
Data analysis: Predict whether or not a credit card application should be
approved.
Algorithms: Decision Tree.

(Monash) FIT5196 5 / 38

Motivations

What does data really look like?

Airline Crash dataset from Wikipedia:

(Monash) FIT5196 6 / 38

Motivations

What does data really look like?

Twitter data 2

2https://dev.twitter.com/rest/reference/get/blocks/list
(Monash) FIT5196 7 / 38

Motivations

What does data really look like?

Fungal disease CT report.

(Monash) FIT5196 8 / 38

Motivations

Our goal

Raw data ⇒ Data Wrangling ⇒ Tidy data ⇒ Data Analysis ⇒ Data
Knowledge

Data + Wrangling + Analysis = Data Product (or Knowledge)

(Monash) FIT5196 9 / 38

Introduction to FIT5196 Data Wrangling

Outline

1 Motivations

2 Introduction to FIT5196 Data Wrangling
Unit structure
Assessments
Unit management

3 Introduction to Data Wrangling

4 Demonstration: Wrangling Air Crashes data with Data
Wrangler

5 Summary

(Monash) FIT5196 10 / 38

Introduction to FIT5196 Data Wrangling Unit structure

Unit Objectives

What the course is trying to achieve:
É parse data in the required format;
É assess the quality of data for problem identification;
É resolve data quality issues ready for the data analysis process;
É integrate data sources for data enrichment;
É document the wrangling process for professional reporting;
É write program scripts for data wrangling processes.

What it is not trying to achieve:
É Introduction to Python programming, e.g., how to program in Python.

You MUST be very familiar with Python and the usage of Python packages!

(Monash) FIT5196 11 / 38

Introduction to FIT5196 Data Wrangling Unit structure

Unit outline

Week topic
1 Introduction to Data Wrangling
2 Introduction to Regular Expressions
3 Parsing Raw Data in Different Formats
4 Text Data Preprocessing
5 Text Data Preprocessing
6 Data Cleansing
7 Data Cleansing
8 Data Cleansing
9 Data Integration
10 Data Integration and reshaping
11 Data Enrichment, Transformation, normalization, etc.
12 Summary

(Monash) FIT5196 12 / 38

Introduction to FIT5196 Data Wrangling Assessments

Assessments

Summary

Value Due Date Type
Assessment 1 35% Week 6 Sunday, 2 September 2018 Individual
Assessment 2 35% Week 10, Wednesday 3 October 2018 Individual
Assessment 3 30% Week 12, Sunday 21 October 2018 Individual

General criteria for marking
É The submitted code must work without any errors and must give the
correct results.

É The code should be well structured and properly commented.
É The notebook should be structured in a logical way so that it clearly shows
how students finish the tasks in the assessment.

(Monash) FIT5196 13 / 38

Introduction to FIT5196 Data Wrangling Assessments

Assessments: 1

Assessment 1: Parsing Data + Text Preprocessing
Brief description: Data extracted from different sources is often stored in
different formats. In this assessment, you are required to write Python
(either Python 2 or Python 3) script to

1 extract data from an XML, HTML, and PFD files,
2 convert data stored in an XML and HTML files to a JSON,
3 and generate sparse representation for the pdf files file.

Due date: Week 6 Sunday, 2 September 2018

(Monash) FIT5196 14 / 38

Introduction to FIT5196 Data Wrangling Assessments

Assessment details: 2

Assessment 2: Parsing and Cleansing Raw Data
Brief description: This assessment addresses one the most important steps
in data wrangling, i.e., cleansing data. Students are required to
É inspect, audit and then identify problems existing in the parsed data; and
propose appropriate methods to fix these problems;

É Different generic and major data problems could be found in the data might
include:

− Lexical errors, e.g., typos and spelling mistakes
− Irregularities, e.g., abnormal data values and data formats
− Violations of the Integrity constraint.
− Outliers
− Duplications
− Missing values
− Inconsistency, e.g., inhomogeneity in values and types in representing the same

data

Hurdle: The outcome of the interview, where students will need to
communicate their processes, justify their approaches, and answer some
questions.
Due date: Week 10, Wednesday 3 October 2018

(Monash) FIT5196 15 / 38

Introduction to FIT5196 Data Wrangling Assessments

Assessment details: 3

Assessment task 3: Data Integration and Reshaping

Brief description: This assessment focuses on data integration and
reshaping. The students are required to integrate data that might be
collected from different sources. The students need to resolve different
levels of conflicts in integration according to what we will be discussed in
the lectures. The output of this assessment should be an integrated dataset
and a Jupyter Notebook containing information about your designed global
schema and all the Python scripts used in integrating the data. Moreover,
the students will need to apply various normalization/transformation
methods to the data, and analyze how they affect the distribution of the
data.

Due date: Week 12, Sunday 21 October 2018

(Monash) FIT5196 16 / 38

Introduction to FIT5196 Data Wrangling Unit management

Unit management: CE & Lecturers

CE: Dr. Lan Du
É Contact: Lan.Du@monash.edu
É Office: R142, Building 63 – 25 Exhibition Walk, Clayton Campus

Lecturer: Mr. Mohammad Haqqani
É Contact: Mohammad.Haqqani@monash.edu
É Office: TBA

Lecturers’ consultation by appointment

(Monash) FIT5196 17 / 38

Introduction to FIT5196 Data Wrangling Unit management

Unit management: Tutors

Tutors

(Monash) FIT5196 18 / 38

Introduction to FIT5196 Data Wrangling Unit management

Unit management: Consultations

Consultation time

(Monash) FIT5196 19 / 38

Introduction to FIT5196 Data Wrangling Unit management

Unit management: Lectures & Tutorial Classes

(Monash) FIT5196 20 / 38

Introduction to Data Wrangling

Outline

1 Motivations

2 Introduction to FIT5196 Data Wrangling

3 Introduction to Data Wrangling
Data Quality Problems
Characteristics of Tidy Data
Major Tasks in Data Wrangling
Programming Environment

4 Demonstration: Wrangling Air Crashes data with Data
Wrangler

5 Summary

(Monash) FIT5196 21 / 38

Introduction to Data Wrangling

Data Wrangling: the No.1 challenge in data analysis

Challenges:
É A massive amount of data has become available and can be collected from
various sources, like mobile devices, web pages, email, ATMs, social media,
corporate databases, etc.

− Freebase dump: 250G
É Data from different sources often comes in different formats (e.g., JSON,
XML, CSV, Excel and PDF)

É Data possesses a variety of data quality issues (e.g., inconsistent values,
duplicates, missing values, and outliers)

É It is impossible to directly run any existing data analysis algorithms over raw
data.

É The whole process of data wrangling can account up to 80% of the time in
the analysis cycle.

Opportunities:
É One prediction for big data analytics: Automating and simplifying complex
data wrangling process with machine learning technologies.

− Enable enterprises, like banks and finance institutions, gain better insights and
derive greater business values from their data.

É Demands from various domains: business, finance, health informatics,
government agents. etc.
(Monash) FIT5196 22 / 38

Introduction to Data Wrangling Data Quality Problems

Data glitches — data quality problems

The real-world data is almost always incomplete, dirty and inconsistent, which
attributes to the necessity of data wrangling.

Where the data problems come from? For example,
É Manual entry errors
É Malfunction of measurement devices
É Data sources follow different conventions, formats, or data models.

Data Quality problems:
É Interpretability issue
É Data format issues
É Inconsistent and faulty data
É Missing values
É Outliers
É Duplicates

(Monash) FIT5196 23 / 38

Introduction to Data Wrangling Data Quality Problems

Data quality problems: Interpretability issue

Is your data set interpretable?

Figure: The Switzerland heart disease data set from UCI machine learning repository

Attributes in columns in order: age, sex, chest pain type, resting blood
pressure, serum cholesterol, fasting blood sugar, etc. 3

A data dictionary is needed.
3See http://archive.ics.uci.edu/ml/datasets/Heart+Disease for more details.

(Monash) FIT5196 24 / 38

Introduction to Data Wrangling Data Quality Problems

Data quality problems: Data format issues

In which format is your data stored?

JavaScript Object Notation (JSON):
Extensible Markup Language (XML)

other formats:
É CSV, Excel and PDF

(Monash) FIT5196 25 / 38

Introduction to Data Wrangling Data Quality Problems

Data quality problems: Inconsistent and faulty data

Does you data contain mis-typed, non-standard, inconsistent entries, etc.?
Mr. Mark John 33 21-08-1985 180 M 0433010010 Mel,VIC
Mr. Chris, Peter 34 21-Sep-1982 ? Fale 0000000000 Syd, NSW
Ethan Steedman 36 01/01/82 17o M 0388886789 Mel,VIC

Inconsistant date format

Age not matching date of birth

Different name formats

(Monash) FIT5196 26 / 38

Introduction to Data Wrangling Data Quality Problems

Data quality problems: Missing values

In your data set, are any data values that should be presented but absent for
some reasons?

Figure: Missing values in the Switzerland
heart disease data set are indicated by “?”.

Big issue of finding missing values:
missing values are defaulted to a
valid value of the variable itself.
É Represent missing values by zero

Missing data could result in serious
bias in the analyses.

(Monash) FIT5196 27 / 38

Introduction to Data Wrangling Data Quality Problems

Data quality problems: outliers

Is there any observation that lies an abnormal distance from the majority of the
other observations in the dataset?4

Figure: Bank transaction data, where x:
hours since last transaction, y: transaction
amount

Outliers can be either bad or
interesting:
É Finance institutions might be
interested in identifying
transactions that do not behave
in a normal way.

− High value transactions are
occurring on inactive accounts

4Figure is from http://blog.easysol.net/advanced-outlier-detection/
(Monash) FIT5196 28 / 38

Introduction to Data Wrangling Data Quality Problems

Data quality problems: outliers

Is there any observation that lies an abnormal distance from the majority of the
other observations in the dataset?4

Figure: Outliers identifies by the Local
Outlier Factor (LOF) method.

Outliers can be either bad or
interesting:
É Finance institutions might be
interested in identifying
transactions that do not behave
in a normal way.

− High value transactions are
occurring on inactive accounts

4Figure is from http://blog.easysol.net/advanced-outlier-detection/
(Monash) FIT5196 28 / 38

Introduction to Data Wrangling Data Quality Problems

Data quality problems: Duplicates

In your data, are there multiple entries that actually corresponds to the same
piece of information?

For example, in a database where DOB and mobile phone number can
uniquely identify an individual and the attribute of interest is height, the two
entries above are duplicates.

(Monash) FIT5196 29 / 38

Introduction to Data Wrangling Data Quality Problems

Data quality problems: Consequences

“Data quality issues can seriously skew the results of data mining and analysis,
with consequences that can potentially cost billions; corporations could make
erroneous decisions on misleading results, and machinery could be incorrectly
calibrated leading to disastrous failures.” — by Dasu. 5

Example

A credit card company is interested in predicting whether an individual will
default on his or her credit payment.

The company will pay for a large price for misclassifying defaulters as
non-defaulters due to the data quality problems.

5“Data Glitches: Monsters in Your Data” in “Handbook of Data Quality” 2013
(Monash) FIT5196 30 / 38

Introduction to Data Wrangling Characteristics of Tidy Data

Tidy data

Data Structure: most statistical datasets are rectangular tables made up of
rows and columns.
Data semantics
É A dataset is a collection of Values.
É A variable contains all values that measure the same underlying attribute.
É An observation contains all values measured on the same unit (like a patient)

(Monash) FIT5196 31 / 38

Introduction to Data Wrangling Characteristics of Tidy Data

Tidy data

A dataset is messy or tidy depending on how rows, columns and types are
matched up with observations, variables and tables.6
É Each variable forms a column
É Each observation forms a row
É Each type of observational unit forms a table
É If you have multiple tables, they should include a column in the table that
allows them to be linked.

6See “Tidy Data” By Hadley Wickham, published in Journal of Statistical Software, 2014
(Monash) FIT5196 31 / 38

Introduction to Data Wrangling Major Tasks in Data Wrangling

Major Task in Data Wrangling

Data acquisition

Data loading & extracting

Data
cleaning

Data
integration

Data
enrichment

Data profiling

Data storing

Data acquisition: Gather data from
different resources, e.g., the web,
sensors, and conventional
databases via API requests (e.g.,
Twitter’s API and Google API),
web scraping (acquiring data from
the Internet through many ways
other than API access), etc. Tools
used include various python
package, pandas, R, etc.

(Monash) FIT5196 32 / 38

Introduction to Data Wrangling Major Tasks in Data Wrangling

Major Task in Data Wrangling

Data acquisition

Data loading & extracting

Data
cleaning

Data
integration

Data
enrichment

Data profiling

Data storing

Data loading & extracting: Load
and parse data stored in many
different formats, like XML, JSON,
CSV, natural language text, etc.
Tools used include, for instance,
BeautifulSoup (one of many python
packages for parsing XML/HTML),
regular expressions, NLTK (a
python package for natural
language processing).

(Monash) FIT5196 32 / 38

Introduction to Data Wrangling Major Tasks in Data Wrangling

Major Task in Data Wrangling

Data acquisition

Data loading & extracting

Data
cleaning

Data
integration

Data
enrichment

Data profiling

Data storing

Data cleaning: Diagnose and
handle various data quality
problems. Performing data cleaning
we need a set of operations that
impute missing values, resolve
inconsistencies, identify/remove
outliers, unify data formats and
other problems discussed in
previous slides.

(Monash) FIT5196 32 / 38

Introduction to Data Wrangling Major Tasks in Data Wrangling

Major Task in Data Wrangling

Data acquisition

Data loading & extracting

Data
cleaning

Data
integration

Data
enrichment

Data profiling

Data storing

Data integration: Merge data from
different resources to create a rich
and complete data set. It involves
a set of operations that resolve
related issues, such as data
duplication, entity matching, and
schema matching.

(Monash) FIT5196 32 / 38

Introduction to Data Wrangling Major Tasks in Data Wrangling

Major Task in Data Wrangling

Data acquisition

Data loading & extracting

Data
cleaning

Data
integration

Data
enrichment

Data profiling

Data storing

Data profiling: Utilises different
kinds of descriptive statistics and
visualisation tools to improve data
quality. The data profiling process
might uncover more data quality
problems, and suggests more
operations for data cleaning and
data integration.

(Monash) FIT5196 32 / 38

Introduction to Data Wrangling Major Tasks in Data Wrangling

Major Task in Data Wrangling

Data acquisition

Data loading & extracting

Data
cleaning

Data
integration

Data
enrichment

Data profiling

Data storing

Data enrichment: Enrich existing
data by feature generation, data
transformation, data aggregation
and data reduction, etc.

(Monash) FIT5196 32 / 38

Introduction to Data Wrangling Major Tasks in Data Wrangling

Major Task in Data Wrangling

Data acquisition

Data loading & extracting

Data
cleaning

Data
integration

Data
enrichment

Data profiling

Data storing

Data storing: Finally store the
clean data in various formats,
which are easily accessible by
downstream analysis tools.

(Monash) FIT5196 32 / 38

Introduction to Data Wrangling Major Tasks in Data Wrangling

Major Task in Data Wrangling

Data acquisition

Data loading & extracting

Data
cleaning

Data
integration

Data
enrichment

Data profiling

Data storing

Documenting the process: We
should also keep a detailed
description of all data
manipulations applied in the above
tasks and generate a proper code
book that describes each variable
and its values in the clean data.
É Why documentation:
collaboration

É Collaboration tools: Jupyter
notebook, Github, etc.

(Monash) FIT5196 32 / 38

Introduction to Data Wrangling Programming Environment

Programming language & environment: Python + Jupyter Notebook

Programming language: Python 2.7 or 3.6
É A scripting language that is easy to get started with and it also comes with a
large number of libraries that can be used in data wrangling tasks

É Major libraries used in this units include (but not limited to)
− Pandas: a library that provides high-level data structures and manipulation

tools that are designed to make data processing fast and easy in Python
− NLTK: a platform for building Python programs to work with human language

data
− BeautifulSoup: a simple and efficient library for navigating, searching, and

modifying HTML and XML documents.
− Scipy: a fundamental library for scientific computing.
− scikit-learn: an efficient Python library for data mining and data analysis.

Programming environment: Jupyter notebook
É The Jupyter Notebook is a web application that allows you to create and
share documents that contain live code, equations, visualisations and
explanatory text.

(Monash) FIT5196 33 / 38

Introduction to Data Wrangling Programming Environment

Programming language & environment: Python + Jupyter Notebook

Dual Python environments
É Conda Managing environments:
https://conda.io/docs/user-guide/tasks/manage-environments.html

1 conda create -n python3 python=3.6
2 conda install nb_conda

Most of the notebooks will run in both versions.

(Monash) FIT5196 34 / 38

Demonstration: Wrangling Air Crashes data with Data Wrangler

Wrangling Air Crashes data with Data Wrangler

Data set: Air Crashes data downloaded from Wikipedia
Application: Data Wrangler from http://vis.stanford.edu/wrangler/
Goal: wrangle the data so that each row corresponds to one crash disaster,
each column is one variable.

(Monash) FIT5196 35 / 38

Demonstration: Wrangling Air Crashes data with Data Wrangler

Wrangling Air Crashes data with Data Wrangler

(Monash) FIT5196 36 / 38

Demonstration: Wrangling Air Crashes data with Data Wrangler

Wrangling Air Crashes data with Data Wrangler

Demonstration with Data Wrangler!

(Monash) FIT5196 37 / 38

Summary

Summary: what to do this week

Please please download and read materials provided in Moodle.

Set up your programming environment by installing Anaconda Python 2 or
Python 3 distribution. (Suggestion!)

Attend tutorial 1 in week 2.
Last but not least
É Choose FIT5196 wisely.
É Use the discussion forum in a proper way and with respect!

(Monash) FIT5196 38 / 38

Motivations
Introduction to FIT5196 Data Wrangling
Unit structure
Assessments
Unit management

Introduction to Data Wrangling
Data Quality Problems
Characteristics of Tidy Data
Major Tasks in Data Wrangling
Programming Environment

Demonstration: Wrangling Air Crashes data with Data Wrangler
Summary