Parsing Raw Data
Parsing Raw Data
Faculty of Information Technology
Monash University, Australia
FIT5196 week 3
(Monash) FIT5196 1 / 18
Outline
1 Extracting Data From CSV files
2 Extracting Data From XML files
3 Extracting Data From JSON files
4 Extracting Data From PDF files
5 Summary
(Monash) FIT5196 2 / 18
Data File Formats: Covered In This Unit
Easy-to-parse (machine-readable) formats:
É CSV: Comma Separated Values
É JSON: JavaScript Object Notation
É XML: eXtensible Markup Language
Hard-to-parse formats:
É Excel
É PDF: Portable Document Format
(Monash) FIT5196 3 / 18
Data File Formats: Not Covered In this Unit
RDF: Resource Description Framework
É A standard model for data interchange on the Web.
É RDF has features that facilitate data merging even if the underlying schemas
differ
É RDF supports the evolution of schemas over time without requiring all the
data consumers to be changed.
É RDF is ideal for storing graph data, such as Knowledge Graphs.
É Python libs
− http://rdflib.readthedocs.io/en/3.4.0/intro_to_graphs.html
HDF5 : Hierarchical Data Format
É HDF5 contains an internal file system-like node structure
É HDF5 can stores multiple datasets and supports metadata
É HDF5 is a good choice for efficiently read and write large datasets.
É Python libs
− PyTables
− h5py
− pandas.HDFStore()
(Monash) FIT5196 4 / 18
http://rdflib.readthedocs.io/en/3.4.0/intro_to_graphs.html
Extracting Data From CSV files
CSV: Comma Separated Values
TSV: Tab Separated Values
Software: Microsoft Excel, Open Office Calc, and Google Spreadsheets.
(Monash) FIT5196 5 / 18
Extracting Data From CSV files
CSV: tools
Pandas functions for reading tabular data
É read_csv(): Read delimited data from a file, URL, or file-like object. Use
comma as default delimiter.
É read_table(): Read delimited data from a file, URL, or file-like object. Use
tab as default delimiter.
É read_fwf(): read data in fixed-width column format, i.e., no delimiters.
É read_clipboard(): Version of read_table that reads data from the clipboard.
Useful for converting tables from web pages.
(Monash) FIT5196 6 / 18
Extracting Data From XML files
Outline
1 Extracting Data From CSV files
2 Extracting Data From XML files
3 Extracting Data From JSON files
4 Extracting Data From PDF files
5 Summary
(Monash) FIT5196 7 / 18
Extracting Data From XML files
XML: Extensible Markup Language1
XML is a software- and hardware-independent tool for storing and
transporting data.
É It simplifies data sharing and platform changes — no need to worry about
issues of exchanging data between incompatible systems
É It simplifies data transport — XML stores data in plain text format
É It simplifies data availability — With XML, data can be available to all kinds
of “reading machines”
XML was designed to be both human- and machine-readable.
1Materials in the following 4 slides are based on
http://www.w3schools.com/xml/default.asp
(Monash) FIT5196 8 / 18
Extracting Data From XML files
XML: DOM tree
According to the DOM (Document Object
Model), everything in an XML document is
a node.
The DOM says:
É The entire document is a document node
É Every XML element is an element node
É The text in the XML elements are text
nodes
É Every attribute is an attribute node
É Comments are comment nodes
(Monash) FIT5196 9 / 18
Extracting Data From XML files
XML: DOM tree
According to the DOM (Document Object
Model), everything in an XML document is
a node.
(Monash) FIT5196 9 / 18
Extracting Data From XML files
XML: tools
ElementTree
É https://docs.python.org/2/library/xml.etree.elementtree.html
É Python’s built-in XML parser.
lxml:
É http://lxml.de/
É Strong performance in parsing very large files
Beautifulsoup
É https://www.crummy.com/software/BeautifulSoup/bs4/doc/
É A Python library for pulling data out of HTML and XML files
É Works with your favourite parser, e.g., html.parser and lxml-xml
Demonstration with Jupyter notebook.
(Monash) FIT5196 10 / 18
Extracting Data From JSON files
JSON: JavaScript Object Notation
JSON: one of the most commonly used formats for transferring data
between web services and other applications via HTTP.
JSON is completely language independent but uses conventions that are
familiar to programmers of the C-family of languages.
JSON is built on two structures2:
É A collection of name/value pairs. In various languages, this is realised as an
object, record, struct, dictionary, hash table, keyed list, or associative array.
É An ordered list of values. In most languages, this is realised as an array,
vector, list, or sequence.
2Materials on the following 3 slides are based on http://www.json.org/
(Monash) FIT5196 11 / 18
Extracting Data From JSON files
JSON: Structure
Three basic elements:
É Object: an unordered set of name/value
pairs
(Monash) FIT5196 12 / 18
Extracting Data From JSON files
JSON: Structure
Three basic elements:
É Array: an array is an ordered collection of
values
(Monash) FIT5196 12 / 18
Extracting Data From JSON files
JSON: Structure
Three basic elements:
É Value: a string in double quotes, or a
number, or true or false or null, or an
object or an array.
(Monash) FIT5196 12 / 18
Extracting Data From JSON files
JSON v.s. XML
(Monash) FIT5196 13 / 18
Extracting Data From JSON files
JSON: tools
json: a built-in Python library used to parse JSON files.
pandas json functions:
É read_json(): Convert a JSON string to pandas object
É json_normalize(): “Normalise” semi-structured JSON data into a flat table
(Monash) FIT5196 14 / 18
Extracting Data From PDF files
PDF: Portable Document Format
A file format used to present and exchange documents
É “looks really do matter” from Adobe
− PDF can contains text, image, link, button, form field, audio and video.
− PDF file encapsulates a complete description of the layout information, e.g.,
fonts, graphics, and other meta information of the document.
Not a data format
(Monash) FIT5196 15 / 18
Extracting Data From PDF files
PDF: An example
(Monash) FIT5196 16 / 18
Extracting Data From PDF files
PDF: Parsing Tools
pdfminer: A tool for extracting text, images, object coordinates, metadata
from PDF documents.
pdftable: A tool for extracting tables from PDF files, it uses pdfminer to
get information on the locations of text elements.
slate: A small Python module that wraps pdfminer’s API.
Tabula: A simple tool for extracting data tables out of PDF files
(Monash) FIT5196 17 / 18
Summary
Summary: what to do this week
1 Download, run and read the notebooks provided in Moodle, and also read
the recommended reading materials associated with each notebook.
2 Try to finish the exercises in each chapter, and post your findings and
experience in the discussion forum.
3 Attend tutorial 3 in the following week.
É Parsing Excel files
4 Assessment 1 released
(Monash) FIT5196 18 / 18
Extracting Data From CSV files
Extracting Data From XML files
Extracting Data From JSON files
Extracting Data From PDF files
Summary