COMP3430 / COMP8430 Data wrangling
Lecture 2: The data wrangling process and understanding data
(Lecturer: )
Lecture outline
¡ñ The data wrangling process / pipeline / tasks
¡ñ Understanding data: sources, types, and formats ¡ñ Example data wrangling tools and resources
The data mining / analytics process
Typically up to 90% of time and effort are spent in the first three steps! (based on: CRISP-DM, the C Standard Process for Data Mining)
Data wrangling
The data wrangling process (1)
The data wrangling process (2)
Main data wrangling tasks
¡ñ Data extraction: From different sources, both internally and increasingly externally to an organisation
¡ñ Data quality assessment: Along a variety of dimensions
¡ñ Data profiling: Exploration, summarisation, and visualisation to
better understand data (data are not modified)
¡ñ Data cleaning: Transformation, reshaping, aggregation, reduction,
imputation, parsing, standardisation (any task that changes data) ¡ñ Data integration: Schema matching and mapping, data matching,
record linkage, deduplication, data fusion
Understanding data
What is data?
¡ñ Data is how we store observations in reusable form
¡ñ Observations are about entities and their attributes, as well as
relationships between entities
¡ñ Sometimes (ideally) entities have unique identifiers (products
have barcodes, most Australians have a Tax File Number (TFN)
or a Medicare number, books have ISBNs, etc.)
¡ñ Unique entity identifiers should be stable over time, accurate,
complete, and robust (like a checksum in an identifier number)
Sources of data (1)
¡ñ Relational databases
¡ñ Transactional data, mostly normalised into many tables, with keys
between them, continuous and frequent updates on (single) records
¡ñ Data warehouses
¡ñ Decision support data, processed and cleaned, historical data,
aggregated, updated at certain intervals
¡ñ Internet
¡ñ Click-stream data, log files, Web pages (HTML, XML), blogs, e-mails,
posts, multi-media data(images, videos, audio, etc.)
Sources of data (2)
¡ñ Files
¡ñ Portable text (like comma separated, tabulator, fixed column) or non-
portable proprietary binary files
¡ñ Scientific instruments, experiments and simulations
¡ñ Astronomy, genomics, seismology, physics, chemistry, etc.
Sensors (often data streams)
¡ñ
¡ñ Internet of Things (IoT)
Data size and complexity
¡ñ ¡°We are drowning in data but starving of knowledge¡± ( , author of the Data Mining text book)
¡ñ Automated data collection and mature database technology
¡ñ Allows data to be stored efficiently, cheap, persistent
¡ñ Using databases, data warehouses and other repositories
¡ñ Data are increasingly stored distributed (storage area networks, grids, etc.)
Large and massive data collections
¡ñ Millions to billions of records
¡ñ Tens to thousands of attributes (sometimes also called variables)
¡ñ Data are rarely collected for data analytics (rather for online transaction processing, OLTP)
A lot of data are write only (or read once only)
¡ñ
¡ñ
Types and measurements of data (1)
¡ñ Numerical data
¡ñ Integer, floating-point, binary, interval, ratio
¡ñ Non-scalar (like velocity: speed and direction; location: latitude and longitude)
¡ñ Non-numerical data
¡ñ Nominal data (just naming things, for example personal names)
¡ñ Categorical data (grouping things, like postcodes, university course codes) ¡ñ Ordinal data (ordering things, for example wine tasting, movie ratings)
¡ñ Series data
¡ñ Ordering is an important feature (otherwise not series data)
¡ñ One attribute must always be monotonic (increasing or decreasing) ¡ñ Most common are time series
Types and measurements of data (2)
¡ñ Multimedia data
¡ñ Images, video, audio
¡ñ Many standard formats used, binary, often compressed (like JPEG)
¡ñ Different mappings and conversions between data types are possible and often needed
¡ñ Some conversions are loss-less, others are lossy
¡ñ Different data wrangling (and data analytics) techniques
can handle different types of data
¡ñ Some are restricted to certain types of data, for example only
numerical data
Formats of data
¡ñ Structured data
¡ñ Relational database tables, integrated data warehouses ¡ñ Images, video, audio (can be compressed)
¡ñ Semi-structured data
¡ñ XML, HTML, e-mails, SMS, log files
¡ñ Free-format data
¡ñ Mainly free-format text – ASCII or Unicode
Data wrangling tools and resources (1)
¡ñ Data wrangling books (mostly specific to a certain language or tool)
¡ñ Data Wrangling with Python; Jacqueline Kazil and Katharine Jarmul,
O’ , 2016
¡ñ Python for Data Analysis; Kinney, O’ , second edition,
2017
¡ñ Data Science from Scratch – First Principles with Python; ,
O’ , second edition, 2019
¡ñ Data Wrangling with R; , Springer, 2016
¡ñ R for Data Science – Import, Tidy, Transform, Visualize, and Model Data;
and , O’ , 2017
Some of these can be found as PDF files for download
¡ñ
Data wrangling tools and resources (2)
¡ñ Programming tools (mostly specific to a certain language or tool) ¡ñ Pandas (Python): http://pandas.pydata.org/
A library that allows efficient data structure and data manipulation and analysis tools, including visualisation (we will show Pandas examples throughout the course)
¡ñ Matplotlib (Python) http://matplotlib.org
A comprehensive 2D plotting library to produce high quality outputs as well as interactive environments
¡ñ Dplyr(R) https://cran.rstudio.com/web/packages/dplyr/vignettes/introduction.html Summarise and transform data in rows and columns
Many more modules / packages relevant to data wrangling
¡ñ
Data wrangling tools and resources (3)
¡ñ Software
¡ñ Rattle ( R ): http://rattle.togaware.com/
A graphical user interface (GUI) on top of R, includes extensive data exploration, visualisation and transformation operations, developed by (previously Senior Data Miner at ATO), used in this course
¡ñ DataWrangler (now TrifactaWrangler) https://www.trifacta.com/ An interactive tool for data cleaning and transformation, developed by a Stanford/Berkeley Wrangler research project, now commercial
¡ñ See also: https://blog.varonis.com/free-data-wrangling-tools/
¡ñ Many database and data warehouse systems do include
some data wrangling functionalities