程序代写代做代考 Excel Java python file system javascript Parsing Raw Data

Parsing Raw Data

Faculty of Information Technology
Monash University, Australia

FIT5196 week 3

(Monash) FIT5196 1 / 18

Outline

1 Extracting Data From CSV files

2 Extracting Data From XML files

3 Extracting Data From JSON files

4 Extracting Data From PDF files

5 Summary

(Monash) FIT5196 2 / 18

Data File Formats: Covered In This Unit

Easy-to-parse (machine-readable) formats:
É CSV: Comma Separated Values
É JSON: JavaScript Object Notation
É XML: eXtensible Markup Language

Hard-to-parse formats:
É Excel
É PDF: Portable Document Format

(Monash) FIT5196 3 / 18

Data File Formats: Not Covered In this Unit

RDF: Resource Description Framework
É A standard model for data interchange on the Web.
É RDF has features that facilitate data merging even if the underlying schemas
differ

É RDF supports the evolution of schemas over time without requiring all the
data consumers to be changed.

É RDF is ideal for storing graph data, such as Knowledge Graphs.
É Python libs

− http://rdflib.readthedocs.io/en/3.4.0/intro_to_graphs.html

HDF5 : Hierarchical Data Format
É HDF5 contains an internal file system-like node structure
É HDF5 can stores multiple datasets and supports metadata
É HDF5 is a good choice for efficiently read and write large datasets.
É Python libs

− PyTables
− h5py
− pandas.HDFStore()

(Monash) FIT5196 4 / 18

http://rdflib.readthedocs.io/en/3.4.0/intro_to_graphs.html

Extracting Data From CSV files

CSV: Comma Separated Values

TSV: Tab Separated Values

Software: Microsoft Excel, Open Office Calc, and Google Spreadsheets.

(Monash) FIT5196 5 / 18

Extracting Data From CSV files

CSV: tools

Pandas functions for reading tabular data
É read_csv(): Read delimited data from a file, URL, or file-like object. Use
comma as default delimiter.

É read_table(): Read delimited data from a file, URL, or file-like object. Use
tab as default delimiter.

É read_fwf(): read data in fixed-width column format, i.e., no delimiters.
É read_clipboard(): Version of read_table that reads data from the clipboard.
Useful for converting tables from web pages.

(Monash) FIT5196 6 / 18

Extracting Data From XML files

Outline

1 Extracting Data From CSV files

2 Extracting Data From XML files

3 Extracting Data From JSON files

4 Extracting Data From PDF files

5 Summary

(Monash) FIT5196 7 / 18

Extracting Data From XML files

XML: Extensible Markup Language1

XML is a software- and hardware-independent tool for storing and
transporting data.
É It simplifies data sharing and platform changes — no need to worry about
issues of exchanging data between incompatible systems

É It simplifies data transport — XML stores data in plain text format
É It simplifies data availability — With XML, data can be available to all kinds
of “reading machines”

XML was designed to be both human- and machine-readable.

1Materials in the following 4 slides are based on
http://www.w3schools.com/xml/default.asp

(Monash) FIT5196 8 / 18

Extracting Data From XML files

XML: DOM tree

According to the DOM (Document Object
Model), everything in an XML document is
a node.
The DOM says:
É The entire document is a document node
É Every XML element is an element node
É The text in the XML elements are text
nodes

É Every attribute is an attribute node
É Comments are comment nodes

(Monash) FIT5196 9 / 18

Extracting Data From XML files

XML: DOM tree

According to the DOM (Document Object
Model), everything in an XML document is
a node.

(Monash) FIT5196 9 / 18

Extracting Data From XML files

XML: tools

ElementTree
É https://docs.python.org/2/library/xml.etree.elementtree.html
É Python’s built-in XML parser.

lxml:
É http://lxml.de/
É Strong performance in parsing very large files

Beautifulsoup
É https://www.crummy.com/software/BeautifulSoup/bs4/doc/
É A Python library for pulling data out of HTML and XML files
É Works with your favourite parser, e.g., html.parser and lxml-xml

Demonstration with Jupyter notebook.

(Monash) FIT5196 10 / 18