Semester 2 2021
Lecture 3, Part II: Data Formats – Semi-structured format: CSV, HTML, and XML
Semi-structured Data
CSV: comma separated values
• Tabular information, with extension .csv
• Structured, but not like excel or a relational DB
• Just a delimited text file, human readable.
• Lacks formatting information
• Does not contain formulas and macros for data verification, transformation
HTML – Hypertext Markup language
• Marked up with elements, correspond to logical units,
• a heading, paragraph or itemised list.
• defines that how web browser will format and display the content
• Elements marked by tags.
• Tags: keywords contained in pairs of angle brackets, not case sensitive • closed tags:
• Unclosed tag:
• Elements can have attributes; ordering of attributes is not significant.
HTML Example
Try it yourself: https://www.w3schools.com/html/tryit.asp?filename=tryhtml5_browsers_myhero HTML examples: https://www.w3schools.com/html/html_lists.asp
Limitations of HTML
• HTML was designed for pure presentation
• HTML is concerned with formatting not meaning
it doesn’t matter what it is about, HTML will format it
• HTML is not extensible
• can’t be modified to meet specific domain knowledge
• browsers have developed their own tags (
• HTML can be inconsistently applied almost everything is rendered somehow e.g. is this acceptable?
XML: eXtensible Markup Language
• Extensible: user defined tags
• Facilitate better encoding of semantics
XML syntax – cont.
• Preserves white spaces.
• ‘<’ and ‘&’ are strictly illegal inside an element
•
•
• CDATA (character data) section may be used inside XML element to include large blocks of text, which may contain these special characters such as &, >
•
•
XML applications
• A ‘meta’ mark-up language.
• Mathematical Markup Language (MathML)
• ChemML (Chemical Markup Language)
• FHIR (Health/Medical data: http://hl7.org/fhir) • RSS, SOAP, SVG, …
MathML example: markup an equation
In MathML, x3+6x+6 is represented as
XML vs HTML
• Extensible — non-extensible
• Case sensitive — not case sensitive • Focus on semantics — display