COMP20008
Elements of Data Processing
Semester 1 2021
Lecture 3 Part I – Part IV: Data Formats
© University of Melbourne 2021
Data Formats
© University of Melbourne 2021
Examples of Data Formats
Unstructured
Semi-Structured
Structured
Text files/documents
XML
Databases
Audio
JSON
Tables
Video
Webpages
Spreadsheets
Social media data
CSV, NoSQL, …
More Machine Readable
More Human Readable
© University of Melbourne 2021
Structured data
Relational databases
© University of Melbourne 2021
Relational Database
https://clockwise.software/blog/relational-vs-non-relational-databases-advantages-and-disadvantages/
© University of Melbourne 2021
Relational Database
ID
Major
0
Linguistics
1
Commerce
…
…
ID
SupervisorName
0
Prof. Tim Baldwin
1
Prof. Trevor Cohn
…
…
Flat file
StudentID
Grade
Supervisor
Major
20201001
H1
Prof. Tim Baldwin
Linguistics
20193032
H2A
Prof. Trevor Cohn
Linguistics
20195309
H2B
Prof. Tim Baldwin
Commerce
…
…
…
…
StudentID
Grade
SupervisorID
MajorID
20201001
H1
0
0
20193032
H2A
1
0
20195309
H2B
0
1
…
…
…
…
© University of Melbourne 2021
Relational Database
ID
Major
0
Linguistics
1
Commerce
…
…
ID
SupervisorName
0
Prof. Tim Baldwin
1
Prof. Trevor Cohn
…
…
SQL
DBMS
StudentID
Grade
SupervisorID
MajorID
20201001
H1
0
0
20193032
H2A
1
0
20195309
H2B
0
1
…
…
…
…
© University of Melbourne 2021
SQL – Structured Query Language
ID
Major
0
Linguistics
1
Commerce
…
…
ID
SupervisorName
0
Prof. Tim Baldwin
1
Prof. Trevor Cohn
…
…
Select StudentID, Grade
from grade_table, supervisor_table where grade_table.SupervisorID
= supervisor_table.ID
and “Supervisor name” = ‘Prof. Tim
Baldwin’
StudentID
Grade
SupervisorID
MajorID
20201001
H1
0
0
20193032
H2A
1
0
20195309
H2B
0
1
…
…
…
…
© University of Melbourne 2021
Database Systems – (INFO20003)
• INFO20003 covers related topics including • SQL
• Specification of integrity constraints
• Data modelling and relational database management systems • Transactions and concurrency control
• Storage management
• Web-based databases
• Highly relevant to data wrangling!
• Useful to do INFO20003 as part of a data science specialisation
© University of Melbourne 2021
Relational Algebra
https://medium.com/swlh/merging-dataframes-with-pandas-pd-merge-7764c7e2d46d
© University of Melbourne 2021
Joins in Python Pandas import pandas as pd
pd.merge()
• on
• left_on, right_on
• how
• inner
• outer • left
• right
inner JOIN
left JOIN
right JOIN
© University of Melbourne 2021
outer JOIN
Image from https://stackoverflow.com/questions/53645882/
Resources
• https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html
• https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html • https://pandas.pydata.org/pandas-docs/version/0.22.0/merging.html
© University of Melbourne 2021
Challenges
• Once data is into a relational database, it is easier to wrangle. • But may be difficult to load it there in the first place …
© University of Melbourne 2021
More structured data – Spreadsheets
• Huge amounts of data is in spreadsheets • Businesses
• Hospitals • ….
• Microsoft (Excel), OpenOffice (Calc), Google Sheets © University of Melbourne 2021
Break
Semi-structured Data
© University of Melbourne 2021
CSV: comma separated values
• Tabular information, with extension .csv
• Structured, but not like excel or a relational DB • Just a delimited text file, human readable.
• Lacks formatting information
• Does not contain formulas and macros for data verification, transformation
© University of Melbourne 2021
HTML – Hypertext Markup language
• Marked up with elements, correspond to logical units,
• a heading, paragraph or itemised list.
• defines that how web browser will format and display the content
• Elements marked by tags.
• Tags: keywords contained in pairs of angle brackets, not case sensitive • closed tags:
• Unclosed tag:
• Elements can have attributes; ordering of attributes is not significant.
© University of Melbourne 2021
HTML Example
Try it yourself: https://www.w3schools.com/html/tryit.asp?filename=tryhtml5_browsers_myhero HTML examples: https://www.w3schools.com/html/html_lists.asp
© University of Melbourne 2021
Limitations of HTML
• HTML was designed for pure presentation
• HTML is concerned with formatting not meaning
it doesn’t matter what it is about, HTML will format it
• HTML is not extensible
• can’t be modified to meet specific domain knowledge
• browsers have developed their own tags (
• HTML can be inconsistently applied almost everything is rendered somehow e.g. is this acceptable?
© University of Melbourne 2021
XML: eXtensible Markup Language
• Extensible: user defined tags
• Facilitate better encoding of semantics
© University of Melbourne 2021
Subject guide in plain-text
Year of offer = 2019
Subject level = Undergraduate level 2 Subject code = COMP20008
Campus = Parkville
Availability = Semester 1, Semester 2
© University of Melbourne 2021
Subject guide in HTML – cont.
© University of Melbourne 2021
Subject guide in xml
© University of Melbourne 2021
XML syntax – well formed
• begin with declaration, the XML prolog.
• Elements
• One root element
• Properly nested
• Attribute values must be quoted • Must have a closing tag:
•
• comments
© University of Melbourne 2021
XML syntax – cont.
• Preserves white spaces.
• ‘<’ and ‘&’ are strictly illegal inside an element
•
•
• CDATA (character data) section may be used inside XML element to include large blocks of text, which may contain these special characters such as &,
>
•
•
© University of Melbourne 2021
XML applications
• A ‘meta’ mark-up language.
• Mathematical Markup Language (MathML)
• ChemML (Chemical Markup Language)
• FHIR (Health/Medical data: http://hl7.org/fhir) • RSS, SOAP, SVG, …
© University of Melbourne 2021
MathML example: markup an equation
In MathML, x3+6x+6 is represented as
© University of Melbourne 2021
XML vs HTML
• Extensible — non-extensible
• Case sensitive — not case sensitive • Focus on semantics — display
Break
Semi-structured Data – JSON
JSON
© University of Melbourne 2021
JavaScript Object Notation (JSON)
• JSON (www.json.org)
• Douglas Crockford (pretty much alone)
• c.f the development of XML by committee
• “Javascript: the good parts” • O’ Reilly, Yahoo Press
© University of Melbourne 2021
JSON syntax rules
• Object data is in name/value pairs “firstName”:”John”
• JSON values
•A number (integer or floating point)
•A string (in double quotes) •A Boolean (true or false)
•An array (in square brackets) •An object (in curly braces) •null
© University of Melbourne 2021
JSON syntax rules
• JSON Objects {“firstName”:”John”,
“lastName”:”Doe”}
• JSON Arrays “employees”:[
{“firstName”:”John”, “lastName”:”Doe”}, {“firstName”:”Anna”, “lastName”:”Smith”}, {“firstName”:”Peter”, “lastName”:”Jones”}
]
• These objects repeat recursively down a hierarchy as needed. • In terms of syntax that’s pretty much it!
© University of Melbourne 2021
JSON format (from json.org)
© University of Melbourne 2021
Verbosity
https://www.w3schools.com/js/js_json_xml.asp
© University of Melbourne 2021
Python libraries for JSON and XML
• json • lxml
XML, JSON, CSV and HTML conversion tools
© University of Melbourne 2021
JSON compared to XML
• JSON is simpler and more compact/lightweight than XML; easy to parse.
• Which appeals to programmers looking for speed and efficiency
• Widely used for storing data in noSQL databases
• Common JSON application – read and display data from a webserver using javascript. https://www.w3schools.com/js/js_json.asp
• XML comes with a large family of other standards for querying and transforming (XQuery, XML Schema, XPATH, XSLT, namespaces, …)
• allows formal validation
• makes you consider the data design more closely
© University of Melbourne 2021
JSON: Summary
• JavaScript Object Notation
• Lightweight, streamlined, standard method of data exchange
• Originally designed to speed up client/server interactions: • By running in the client browser
• Can be used to represent any kind of semi structured data • Lacks context and schema definitions
© University of Melbourne 2021
Break
Semi-structured Data – Cont.
PDF
© University of Melbourne 2021
PDF – Portable Document Format
• A format for presenting documents independently of application software / OS
• Introduced by Adobe in 1993, standardized in 2008
• Large amounts of useful data stored in PDF format
(e.g. forms, invoices, etc.). Legacy data in particular.
© University of Melbourne 2021
Challenges of PDF format
• Format is very flexible, limited consistency
• May contain different types of data; similar looking documents can be
represented very differently
• Text may be stored as images, e.g., for scanned documents
• Data is unstructured
• As a result, automated extraction is difficult
• May require substantial preprocessing to extract useable data © University of Melbourne 2021
Approaches to PDF data
Convert the PDF to text using a library (e.g. poppler, PDFMiner):
pdftotext nvsr65_05.pdf nvsr65_05.txt
!”#$%&”‘()$#”‘( *#”#$+#$,+(-./%0#+( )%’12.(345(!126.0(4(
71&.(895(:9;3(
<."#=+>(?.”@$&A(B”1+.+(C%0(:9;D( 6E(F.’%&$.(G.0%&5(H=I