PowerPoint Presentation
1
XML & XPath
DSCI 551
Wensheng Wu
2
3
Agenda
• XML:
– What is it and why do we care?
– Data model (ordered tree)
– Query language: XPath
4
XML
• eXtensible Markup Language
• XML 1.0 – a recommendation from W3C, 2008
• Root: SGML (standard generalized markup
language)
• After the root: a format for sharing data
• Ajax (x – XML)
• jquery ($.ajax(…, format=‘XML’/’JSON’))
SGML
• Derived from IBM’s GML (generalized
ML) developed in 1960’s
– Charles Goldfarb, Edward Mosher, and
Raymond Lorie
– For sharing of large-project documents
• Basis for HTML & XML
– XML is roughly an augmented subset (adds
more restrictions)
– HTML is an application of SGML 5
6
Why XML is of Interest to Us
• XML is a syntax (serialization format) for
data
• This is exciting because:
– Can translate any data to XML
– Can ship XML over the Web (HTTP)
– Can input XML into any application
– Thus: data sharing and exchange on the Web
7
XML Data Sharing and Exchange
application
relational data
Transform
Integrate
Warehouse
XML Data WEB (HTTP)
application
application
legacy data
object-relational
Specific data management tasks
8
From HTML to XML
HTML describes the presentation
9
HTML
Bibliography
Foundations of Databases
Abiteboul, Hull, Vianu
Addison Wesley, 1995
Data on the Web
Abiteoul, Buneman, Suciu
Morgan Kaufmann, 1999
10
XML
…
XML describes the content
11
Web Services
• A software system designed to support
interoperable machine-to-machine
interaction over a network (from Wikipedia)
• Use http for machine-machine
communications of files
– E.g., in XML & JSON formats
https://en.wikipedia.org/wiki/Interoperability
https://en.wikipedia.org/wiki/Computer_network
Ajax
• Asynchronous Javascript and XML
• Web clients send and receive data from
server asynchronously
– Benefit: more responsive web pages
• Common to use XML, JSON as data format
12
Ajax in action (link)
13
https://www.amazon.com/s?k=data+mining&ref=nb_sb_noss_2
14
XML Terminology
• tags: book, title, author, …
• start tag:
• elements:
• elements may be nested:
• empty element (no content):
– Note that an empty element can have attributes
–
]>
XML schema
26
27
Example XML for Company
DTD
…
Example of valid XML document:
28
DTD: The Content Model
• Content model:
– Complex = a regular expression over other elements
– Text-only = #PCDATA/#CDATA
– Empty = EMPTY
– Any = ANY
– Mixed content = (#PCDATA | A | B | C)*
• #CDATA (#PCDATA)
– Character data not are (are) parsed by parser
– Tags inside #PCDATA will be treated as markup
content
model
29
DTD: Regular Expressions
. . . . . .
Processing instructions
•
• This is the first line of an XML document
– Declaring that the following is an XML doc…
– that follows standard version 1.0
– and whose encoding is UTF-8
30
31
Agenda
• XML:
– What is it and why do we care?
– Data model
– Query language: XPath
32
Querying XML Data
• XPath = simple navigation through the tree
• XQuery = the SQL of XML
33
…
…
34
Data Model for XPath
bib
book book
publisher author . . . .
Addison-Wesley Serge Abiteboul
Document node
The root element
35
XPath: Simple Expressions
Result:
Result: empty (there were no papers)
/bib/book/year
/bib/paper/year
36
//: finding descendants
Result:
Result:
//author
/bib//first-name
Select Child by Index
• Index of children starts from 1
• //author[1]
• /bib/book[2]/author
37
38
Xpath: Text Nodes
Result: Serge Abiteboul
Victor Vianu
Jeffrey D. Ullman
Rick Hull doesn’t appear because he has firstname, lastname elements
Functions in XPath:
– text() = matches text nodes
– * = matches only element nodes
– node() = matches any node (element or text)
/bib/book/author/text()
39
Xpath: Wildcard
Result:
* Matches any element
//author/*
40
Xpath: Attribute Nodes
Result: [’35’, ’55’]
@price means that price has to be an attribute
Is it the same as ?
/bib/book/@price
/bib/book[@price]
Xpath: Attribute nodes
• /bib/book/@*
– Return all attribute nodes of book elements
• Result:
– [’35’, ’55’]
41
42
Xpath: Predicates
Return author elements (under /bib/book) which
have a child element called “first-name”
Result:
/bib/book/author[first-name]
43
Xpath: More Predicates
Return lastname of author elements which have child element
firstname and child element “address” which itself has …
Result:
/bib/book/author[firstname][address[//zip][city]]/lastname
44
Xpath: More Predicates
/bib/book[@price < 60] /bib/book[author/@age < 25] /bib/book[author/text()] Return books under bib that have an author element with a text node 45 Xpath: More Predicates /bib/book[contains(author, 'Ullman')] Return books under bib whose (first) author subelment contains the word 'Ullman' in its text node (note contains is case-sensitive) What about //book/author[contains(., "Ullman")] ? Xpath: More Predicates • /bib/book[author = "Victor Vianu"] • /bib/book[author/text() = "Victor Vianu"] • /bib/book/author[. = 'Victor Vianu'] 46 Xpath: More Predicates • /bib/book[price > 30 or year > 1995]
• /bib/book[price > 30 and year >= 1995]
• /bib/book[not(price > 30)]
• Note: and, or, not should be all lowercases
47
Parenthesis required for not
Xpath: More Predicates
• /bib/book[not(publisher)]
• What about /bib/book[author[not(node())]]?
48
Xpath: alternatives
49
Return book and cd elements under /bib
/bib/book|/bib/cd
Questions
50
//*
What do these return?
//@*
Resources
• Comparison of SGML and XML
– https://www.w3.org/TR/NOTE-sgml-xml-
971215/
• XML
– http://www.w3schools.com/xml/default.asp
• XPath
– http://www.w3schools.com/xml/xml_xpath.asp
51
https://www.w3.org/TR/NOTE-sgml-xml-971215/
http://www.w3schools.com/xml/default.asp
http://www.w3schools.com/xml/xml_xpath.asp
Resources
• Testers
– https://codebeautify.org/Xpath-Tester (no
support for alternation such as “/bib/(book|cd)”,
but /bib/book|/bib/cd is ok)
– https://www.freeformatter.com/xpath-
tester.html ( no support for “contains”, but
support both forms of alternations above)
– http://www.xpathtester.com/xpath
52
https://codebeautify.org/Xpath-Tester
https://codebeautify.org/Xpath-Tester
http://www.xpathtester.com/xpath