COMP20008
Elements of data processing
Semester 1 2020
Lecture 3: Data formats
Student representatives
Lucas Fern lfern@student.unimelb.edu.au Sanjog Gururaj sgururaj@student.unimelb.edu.au
Please contact them with any feedback and suggestions you have about this subject
Changes to assessments
– Assignment one will be worth 25%
– Assignment two also 25%
– No oral presentation (originally worth 10 %)
– Final exam remains 50%
– Hurdle requirements unchanged – 20/50 for assignments
– 20/50 for exam
A data scenario
Next week we are going to make a presentation to the CEO and we need to include a profile of all our customers: who they are, their past purchasing behaviour and all the types of interactions they’ve had with our company.
The data exists, kind of, but it is spread across multiple systems and is in many types of formats: CSV, XML, JSON, HTML, spreadsheets, ….
Data formats
• How is data stored and in what formats? – Structured: Relational databases, excel, …
– Unstructured: text, images
– Semi-structured: CSV,HTML, XML, JSON, …
• Why do we have different data formats
• Why do we wish to transform between different formats?
Structured data – Relational databases
• It is good to have structure for data!
• Easier to analyse, easier to query, easier to store
• Easier to clean, maintain consistency and security, especially with multiple
users
2/03/2018 7:17 05am
Attribute
Tuple
Relation
• Relational databases, the classic method of storing structured data (banking, sales, airlines …), as a table
• Can query the data using a high-level language such as SQL
Example – a relation table
Attributes
https://www.sites.google.com/site/merasemester/dbm/relational-model?tmpl=%2Fsystem%2Fapp%2Ftemplates%2Fprint%2F&showPrintDialog=1
Sample relational database
https://web.csulb.edu/colleges/coe/cecs/dbdesign/dbdesign.php?page=manymany.php
SQL – a language used for relational databases
create table branch
(branch_name char(15) not null,
branch_city char(30),
assets integer, primary key (branch_name))
SQL
select from where
account.balance
depositor, account
depositor.customer_id = ʻ192-83-7465ʼ and
depositor.account_number = account.account_number
DBMS – Database management system
http://www.expertsmind.com/questions/physical-dbms-architecture-30139088.aspx
Database systems – (INFO20003)
• INFO20003 covers related topics including • SQL
• Specification of integrity constraints
• Data modelling and relational database management systems • Transactions and concurrency control
• Storage management
• Web-based databases
• Highly relevant to data wrangling!
• Useful to do INFO20003 as part of a data science specialisation
Challenges
• Once data is into a relational database, it is easier to wrangle. • But may be difficult to load it there in the first place …
More structure – Spreadsheets
• Huge amounts of data is in spreadsheets • Businesses
• Hospitals • ….
• Microsoft (Excel), OpenOffice (Calc), Google docs
Information in tabular format (transactional, simple but many entries)
Unstructured data – Text
Text files…
• No structure
• Harder to index
• Harder to organise
• Lacks regularity and decomposable internal structure
• How can we process and search for textual information?
More on text data next week.
Semi-structured data
CSV: comma separated values
• Also very popular
• Also stores tabular information
• Just a delimited text file with extension .csv
• human readable, versus binary XLS format (Excel) • Manipulate with any text editor
• Lacks formatting information
• Does not contain formulas and macros (data verification, transformation)
Example – CSV
Spreadsheets Easy to use
Structured, but not
like excel or a relational DB
HTML – Hypertext Markup language
• Marked up with elements, delineated by start and end tags.
• Elements correspond to logical units, such as a heading, paragraph or
itemised list.
• Tags: Keywords contained in pairs of angle brackets.
• Not case sensitive.
• Browser determines how to display/present the logical units
• Not all elements need both start and end tags.
• Elements can have attributes; ordering of attributes is not significant.
HTML Example
About the Melbourne School of Engineering
- Dean’s Welcome
- Leadership & Professional Staff
- Contact Us
- ECR: Computer Resources
- For Staff (intranet)
- For Casual Staff
- Professional Staff Review
- Environment, Health & Safety
- Committees
Try it yourself: https://www.w3schools.com/html/tryit.asp?filename=tryhtml5_browsers_myhero HTML examples: https://www.w3schools.com/html/html_lists.asp
Limitations of HTML
• HTML was designed for pure presentation
• HTML is concerned with formatting not meaning
it doesn’t matter what it is about, HTML will format it
• HTML is not extensible
• can’t be modified to meet specific domain knowledge
• browsers have developed their own tags (
• HTML can be inconsistently applied almost everything is rendered somehow e.g. is this acceptable?
XML: eXtensible Markup Language
• A ‘meta’ mark-up language
• Extensible: user defined tags
• Facilitate better encoding of semantics
Subject guide in plain-text
Year of offer = 2019
Subject level = Undergraduate level 2 Subject code = COMP20008
Campus = Parkville
Availability = Semester 1, Semester 2
Subject guide in HTML
Subject guide in HTML – cont.
Subject guide in HTML – cont.
Subject guide in xml
XML syntax – well formed
• xml files must begin with declaration
• xml elements
• One root element
• Appropriately nested
• Start/end tags or Empty tags • Attributes in quotes
• comments
XML syntax – cont.
• some characters have special meaning
• ‘<’ and ‘&’ are strictly illegal inside an element
•
•
• CDATA (character data) section may be used inside XML element to include large blocks of text, which may contain these special characters such as &, >
•
•
&
Using HTML/XML (Python)
• For HTML scraping, the Beautiful Soup library is good
• For XML, a good library is http://lxml.de/
• Import the XML file into your program as a tree structure: import xml.etree.ElementTree as ET
tree = ET.parse(‘yourfile.xml’)
root = tree.getroot()
• Then loop through root with the various methods available: for child in root:
print child.tag, child.attrib
XML applications
• Mathematical Markup Language (MathML) • ChemML (Chemical Markup Language)
• RSS, SOAP, SVG, …
MathML example: markup an equation in terms of presentation and semantics
In MathML, x3+6x+6 is represented as
JavaScript Object Notation (JSON)
• JSON (www.json.org)
• Douglas Crockford (pretty much alone)
• c.f the development of XML by committee
• “Javascript: the good parts” • O’ Reilly, Yahoo Press
JSON
{ “Catalog”: [
{ “CD”: {
“title”: “Empire Burlesque”, “artist”: “Bon Dylan”,
“price”: {
”currency”: “USD”,
“value”: 10.90 },
“year”: 1985 }
},
{ “CD”: {
“title”: “Hide your heart”, “artist”: “Bonnie Taylor”, “price”: {
“currency”: “USD”, “value”: 9.90
},
“year”: 1988 }
} ]
}
JSON object example (cont. next slide)
{
“firstName”: “David”, “lastName”: “Lynn”, “isAlive”: true,
“age”: 25, “height_cm”: 167.6, “address”: {
“streetAddress”: “211 Fox Street”, “city”: “Greenville”,
“state”: “NH”,
“postalCode”: “80021”
},
JSON example (cont.)
“phoneNumbers”: [
{
“type”: “home”,
“number”: “315 555-1812” },
{
“type”: “office”,
“number”: “646 555-4567”
], }
“email”: “dlynn@nhs.net”
}
XML representation
….
JSON syntax rules
• Object data is in name/value pairs “firstName”:”John”
• JSON values
•A number (integer or floating point)
•A string (in double quotes) •A Boolean (true or false)
•An array (in square brackets) •An object (in curly braces) •null
JSON syntax rules
• JSON Objects
{“firstName”:”John”, “lastName”:”Doe”}
• JSON Arrays “employees”:[
{“firstName”:”John”, “lastName”:”Doe”}, {“firstName”:”Anna”, “lastName”:”Smith”}, {“firstName”:”Peter”, “lastName”:”Jones”}
]
• These objects repeat recursively down a hierarchy as needed. • In terms of syntax that’s pretty much it!
Using JSON (Python)
import json
json.loads( ‘[“foo”,
{“bar”:
[“baz”, null, 1.0, 2]
} ]’)
json.dumps( [‘foo’,
{‘bar’:
(‘baz’, None, 1.0, 2)
} ]
– Load JSON format to Python Dictionary
)
– Convert to JSON format from Python Dictionary
Note: white space and indentation is for human purposes only!
JSON format (from json.org)
JSON format (json.org)
JSON compared to XML
• JSON is simpler and more compact/lightweight than XML; easy to parse.
• Common JSON application – read and display data from a webserver using javascript.
• https://www.w3schools.com/js/js_json.asp
• XML comes with a large family of other standards for querying and transforming (XQuery, XML Schema, XPATH, XSLT, namespaces, …)
JSON vs XML cont.
• XML allows complex schema definitions (via regular expressions) • allows formal validation
• makes you consider the data design more closely
• JSON is more streamlined, lightweight and compressed
• Which appeals to programmers looking for speed and efficiency • Widely used for storing data in noSQL databases
Exercise
Represent the following information in JSON
Check it is well formed: http://jsonlint.com
JSON: Summary
• JavaScript Object Notation
• Lightweight, streamlined, standard method of data exchange
• Originally designed to speed up client/server interactions: • By running in the client browser
• Native Javascript, so can be executed as code
• Can be used to represent any kind of semi structured data • Lacks context and schema definitions
Python libraries for JSON and XML
• json • lxml
XML, JSON, CSV and HTML conversion tools
What you should know
• Why do we have different data formats and why do we wish to transform between different formats?
• Motivation for using relational databases to manage information
• What is a csv, what is a spreadsheet, what is the difference?
• Difference between HTML and XML and when to use each
• Be able to read and write data in XML (elements, attributes)
• Be able to read and write data in JSON
• Difference between XML and JSON; applications where each can
be used.
Further reading
• Relational databases
Pages 403-409 of http://i.stanford.edu/~ullman/focs/ch08.pdf
• XML http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SG.html