CS计算机代考程序代写 Java concurrency data science database prolog javascript SQL Excel flex python COMP20008

COMP20008
Elements of Data Processing
Semester 1 2021
Lecture 3 Part I – Part IV: Data Formats
© University of Melbourne 2021

Data Formats
© University of Melbourne 2021

Examples of Data Formats
Unstructured
Semi-Structured
Structured
Text files/documents
XML
Databases
Audio
JSON
Tables
Video
Webpages
Spreadsheets
Social media data
CSV, NoSQL, …
More Machine Readable
More Human Readable
© University of Melbourne 2021

Structured data
Relational databases
© University of Melbourne 2021

Relational Database
https://clockwise.software/blog/relational-vs-non-relational-databases-advantages-and-disadvantages/
© University of Melbourne 2021

Relational Database
ID
Major
0
Linguistics
1
Commerce


ID
SupervisorName
0
Prof. Tim Baldwin
1
Prof. Trevor Cohn


Flat file
StudentID
Grade
Supervisor
Major
20201001
H1
Prof. Tim Baldwin
Linguistics
20193032
H2A
Prof. Trevor Cohn
Linguistics
20195309
H2B
Prof. Tim Baldwin
Commerce




StudentID
Grade
SupervisorID
MajorID
20201001
H1
0
0
20193032
H2A
1
0
20195309
H2B
0
1




© University of Melbourne 2021

Relational Database
ID
Major
0
Linguistics
1
Commerce


ID
SupervisorName
0
Prof. Tim Baldwin
1
Prof. Trevor Cohn


SQL
DBMS
StudentID
Grade
SupervisorID
MajorID
20201001
H1
0
0
20193032
H2A
1
0
20195309
H2B
0
1




© University of Melbourne 2021

SQL – Structured Query Language
ID
Major
0
Linguistics
1
Commerce


ID
SupervisorName
0
Prof. Tim Baldwin
1
Prof. Trevor Cohn


Select StudentID, Grade
from grade_table, supervisor_table where grade_table.SupervisorID
= supervisor_table.ID
and “Supervisor name” = ‘Prof. Tim
Baldwin’
StudentID
Grade
SupervisorID
MajorID
20201001
H1
0
0
20193032
H2A
1
0
20195309
H2B
0
1




© University of Melbourne 2021

Database Systems – (INFO20003)
• INFO20003 covers related topics including • SQL
• Specification of integrity constraints
• Data modelling and relational database management systems • Transactions and concurrency control
• Storage management
• Web-based databases
• Highly relevant to data wrangling!
• Useful to do INFO20003 as part of a data science specialisation
© University of Melbourne 2021

Relational Algebra
https://medium.com/swlh/merging-dataframes-with-pandas-pd-merge-7764c7e2d46d
© University of Melbourne 2021

Joins in Python Pandas import pandas as pd
pd.merge()
• on
• left_on, right_on
• how
• inner
• outer • left
• right
inner JOIN
left JOIN
right JOIN
© University of Melbourne 2021
outer JOIN
Image from https://stackoverflow.com/questions/53645882/

Resources
• https://jakevdp.github.io/PythonDataScienceHandbook/03.07-merge-and-join.html
• https://jakevdp.github.io/PythonDataScienceHandbook/03.06-concat-and-append.html • https://pandas.pydata.org/pandas-docs/version/0.22.0/merging.html
© University of Melbourne 2021

Challenges
• Once data is into a relational database, it is easier to wrangle. • But may be difficult to load it there in the first place …
© University of Melbourne 2021

More structured data – Spreadsheets
• Huge amounts of data is in spreadsheets • Businesses
• Hospitals • ….
• Microsoft (Excel), OpenOffice (Calc), Google Sheets © University of Melbourne 2021

Break

Semi-structured Data
© University of Melbourne 2021

CSV: comma separated values
• Tabular information, with extension .csv
• Structured, but not like excel or a relational DB • Just a delimited text file, human readable.
• Lacks formatting information
• Does not contain formulas and macros for data verification, transformation
© University of Melbourne 2021

HTML – Hypertext Markup language
• Marked up with elements, correspond to logical units,
• a heading, paragraph or itemised list.
• defines that how web browser will format and display the content
• Elements marked by tags.
• Tags: keywords contained in pairs of angle brackets, not case sensitive • closed tags: content
• Unclosed tag:
• Elements can have attributes; ordering of attributes is not significant.
© University of Melbourne 2021

HTML Example
Try it yourself: https://www.w3schools.com/html/tryit.asp?filename=tryhtml5_browsers_myhero HTML examples: https://www.w3schools.com/html/html_lists.asp
© University of Melbourne 2021

Limitations of HTML
• HTML was designed for pure presentation
• HTML is concerned with formatting not meaning
it doesn’t matter what it is about, HTML will format it
• HTML is not extensible
• can’t be modified to meet specific domain knowledge
• browsers have developed their own tags (, )
• HTML can be inconsistently applied almost everything is rendered somehow e.g. is this acceptable?
© University of Melbourne 2021

XML: eXtensible Markup Language
• Extensible: user defined tags
• Facilitate better encoding of semantics
© University of Melbourne 2021

Subject guide in plain-text
Year of offer = 2019
Subject level = Undergraduate level 2 Subject code = COMP20008
Campus = Parkville
Availability = Semester 1, Semester 2
© University of Melbourne 2021

Subject guide in HTML – cont.
© University of Melbourne 2021

Subject guide in xml
© University of Melbourne 2021

XML syntax – well formed
• begin with declaration, the XML prolog.
• Elements
• One root element
• Properly nested
• Attribute values must be quoted • Must have a closing tag:
Parkville
(self closing tag with an attribute) • Case sensitive
• comments

© University of Melbourne 2021

XML syntax – cont.
• Preserves white spaces. I think … therefore I am • some characters have special meaning
• ‘<’ and ‘&’ are strictly illegal inside an element • all books & videos are now < AUD 10
allbooks&videosarenow<AUD10
• CDATA (character data) section may be used inside XML element to include large blocks of text, which may contain these special characters such as &,
>

© University of Melbourne 2021

XML applications
• A ‘meta’ mark-up language.
• Mathematical Markup Language (MathML)
• ChemML (Chemical Markup Language)
• FHIR (Health/Medical data: http://hl7.org/fhir) • RSS, SOAP, SVG, …
© University of Melbourne 2021

MathML example: markup an equation
In MathML, x3+6x+6 is represented as

x 3
+
6 &InvisibleTimes; x

+
6

© University of Melbourne 2021

XML vs HTML
• Extensible — non-extensible
• Case sensitive — not case sensitive • Focus on semantics — display

Break

Semi-structured Data – JSON
JSON
© University of Melbourne 2021

JavaScript Object Notation (JSON)
• JSON (www.json.org)
• Douglas Crockford (pretty much alone)
• c.f the development of XML by committee
• “Javascript: the good parts” • O’ Reilly, Yahoo Press
© University of Melbourne 2021

JSON syntax rules
• Object data is in name/value pairs “firstName”:”John”
• JSON values
•A number (integer or floating point)
•A string (in double quotes) •A Boolean (true or false)
•An array (in square brackets) •An object (in curly braces) •null
© University of Melbourne 2021

JSON syntax rules
• JSON Objects {“firstName”:”John”,
“lastName”:”Doe”}
• JSON Arrays “employees”:[
{“firstName”:”John”, “lastName”:”Doe”}, {“firstName”:”Anna”, “lastName”:”Smith”}, {“firstName”:”Peter”, “lastName”:”Jones”}
]
• These objects repeat recursively down a hierarchy as needed. • In terms of syntax that’s pretty much it!
© University of Melbourne 2021

JSON format (from json.org)
© University of Melbourne 2021

Verbosity
https://www.w3schools.com/js/js_json_xml.asp
© University of Melbourne 2021

Python libraries for JSON and XML
• json • lxml
XML, JSON, CSV and HTML conversion tools
© University of Melbourne 2021

JSON compared to XML
• JSON is simpler and more compact/lightweight than XML; easy to parse.
• Which appeals to programmers looking for speed and efficiency
• Widely used for storing data in noSQL databases
• Common JSON application – read and display data from a webserver using javascript. https://www.w3schools.com/js/js_json.asp
• XML comes with a large family of other standards for querying and transforming (XQuery, XML Schema, XPATH, XSLT, namespaces, …)
• allows formal validation
• makes you consider the data design more closely
© University of Melbourne 2021

JSON: Summary
• JavaScript Object Notation
• Lightweight, streamlined, standard method of data exchange
• Originally designed to speed up client/server interactions: • By running in the client browser
• Can be used to represent any kind of semi structured data • Lacks context and schema definitions
© University of Melbourne 2021

Break

Semi-structured Data – Cont.
PDF
© University of Melbourne 2021

PDF – Portable Document Format
• A format for presenting documents independently of application software / OS
• Introduced by Adobe in 1993, standardized in 2008
• Large amounts of useful data stored in PDF format
(e.g. forms, invoices, etc.). Legacy data in particular.
© University of Melbourne 2021

Challenges of PDF format
• Format is very flexible, limited consistency
• May contain different types of data; similar looking documents can be
represented very differently
• Text may be stored as images, e.g., for scanned documents
• Data is unstructured
• As a result, automated extraction is difficult
• May require substantial preprocessing to extract useable data © University of Melbourne 2021

Approaches to PDF data
Convert the PDF to text using a library (e.g. poppler, PDFMiner):
pdftotext nvsr65_05.pdf nvsr65_05.txt
!”#$%&”‘()$#”‘( *#”#$+#$,+(-./%0#+( )%’12.(345(!126.0(4(
71&.(895(:9;3(
<."#=+>(?.”@$&A(B”1+.+(C%0(:9;D( 6E(F.’%&$.(G.0%&5(H=I