9/1/2021
1
Introduction to Data Science
Lecture 4
Semi-Structured Data,
XML
CIS 5930/4930 – Fall 2021
Exam
• One Midterm exam
– Proctored, online (Canvas)
– November 3 during regular scheduled lecture time
• 3.05 – 4.20pm EST
– To be done individually
– Multiple choice test
– Closed book, notes
– A week before (October 25, 27, November 1st)
• during regular scheduled lecture time – 3.05 – 4.20pm EST
• course material review lectures
– No other exams
CIS 5930/4930 – Fall 2021
Exam
• Exam (online) – the student is solely
responsible for obtaining/installing all
necessary components for Honorlock to
function properly. This includes, but is not
limited to:
o A laptop or desktop computer
o A webcam
o A microphone
o A stable and strong Internet connection
CIS 5930/4930 – Fall 2021
SQL → Relational Algebra
CIS 5930/4930 – Fall 2021
Product (pid, name, price)
Purchase (pid, cid, store)
Customer (cid, name, city)
SELECT DISTINCT P.name, C.name
FROM Product AS P, Purchase AS Pu, Customer AS C
WHERE P.pid = Pu.pid AND C.cid = Pu.cid AND
P.price > 100 AND C.city = ‘Houston’
SQL → Relational Algebra
CIS 5930/4930 – Fall 2021
Product (pid, name, price)
Purchase (pid, cid, store)
Customer (cid, name, city)
δ
P
Product Purchase
cid = cid
Customer
s
P.name, C.name
price > 100 and city = “Houston”
pid = pid
Relational Query Optimization
CIS 5930/4930 – Fall 2021
δ
P
Product Purchase
cid = cid
Customer
P.name, C.name
pid = pid
s city = ‘Houston’
s Price > 100
Product (pid, name, price)
Purchase (pid, cid, store)
Customer (cid, name, city)
1 2
3 4
5 6
9/1/2021
2
Relational Model Benefits
CIS 5930/4930 – Fall 2021
• Physical Data Independence
• Oblivious to data organization on disk
• Logical Data Independence
• Schema can evolve without affecting
the users/applications
XML Data Model
bib
book book
publisher
author . . . .
Addison-Wesley Serge Abiteboul
The root
The root element
CIS 5930/4930 – Fall 2021
XML Sample
CIS 5930/4930 – Fall 2021
XPath: Simple Expressions
Result:
Result: empty (there were no papers)
/bib/book/year
/bib/paper/year
CIS 5930/4930 – Fall 2021
Result:
Result:
//author
/bib//first-name
XPath: Simple Expressions
CIS 5930/4930 – Fall 2021
XPath: Attribute Nodes
Result: “55, 75”
@price means that price is has to be an
attribute
/bib/book/@price
CIS 5930/4930 – Fall 2021
7 8
9 10
11 12
9/1/2021
3
XPath: Wildcard
Result:
* Matches any element
@* Matches any attribute
//author/*
CIS 5930/4930 – Fall 2021
XPath: Predicates
Result:
/bib/book/author[first-name=‘Rick’]
CIS 5930/4930 – Fall 2021
XPath: More Predicates
//product[@price < 60] /bib/book[author/@age < 25] CIS 5930/4930 - Fall 2021 Xpath: Summary bib matches a bib element * matches any element / matches the root element /bib matches a bib element under root bib/paper matches a paper in bib bib//paper matches a paper in bib, at any depth //paper matches a paper at any depth paper|book matches a paper or a book @price matches a price attribute bib/book/@price matches price attribute in book, in bib bib/book[@price<“55”]/author/lastname matches… CIS 5930/4930 - Fall 2021 XQuery • Based on Quilt, which is based on XML-QL • Uses XPath to express more complex queries CIS 5930/4930 - Fall 2021 FLWR (“Flower”) Expressions FOR ... LET... WHERE... RETURN... CIS 5930/4930 - Fall 2021 13 14 15 16 17 18 9/1/2021 4 FOR-WHERE-RETURN Find all book titles published after 1995: FOR $x IN document("bib.xml")/bib/book, $y IN $x/title WHERE $x/year/text() > 1995
RETURN $y
CIS 5930/4930 – Fall 2021
FOR-WHERE-RETURN
Equivalent
FOR $x IN document(“bib.xml”)/bib/book[year/text() > 1995] /title
RETURN $x
Even shorter:
document(“bib.xml”)/bib/book[year/text() > 1995]/title
CIS 5930/4930 – Fall 2021
FOR-WHERE-RETURN
• Find all book titles and the year when they
were published:
FOR $x IN document(“bib.xml”)/ bib/book
RETURN
CIS 5930/4930 – Fall 2021
Nesting
For each author of a book by Morgan
Kaufmann, list all books s/he published:
FOR $b IN document(“bib.xml”)/bib,
$a IN $b/book[publisher /text()=“Morgan Kaufmann”]/author
RETURN
{ $a,
FOR $t IN $b/book[author/text()=$a/text()]/title
RETURN $t
}
In the RETURN clause comma concatenates XML fragments
CIS 5930/4930 – Fall 2021
Result
CIS 5930/4930 – Fall 2021
Aggregates, Ordering
Find all books with more than 3 co-authors:
count = a function that counts
avg = computes the average
sum = computes the sum
distinct-values = eliminates duplicates
Order by – orders the result set in ascending or descending order
FOR $x IN document(“bib.xml”)/bib/book
WHERE count($x/author)>=3
ORDER BY $x/@price descending
RETURN $x
CIS 5930/4930 – Fall 2021
19 20
21 22
23 24
9/1/2021
5
Aggregates
Same thing:
FOR $x IN document(“bib.xml”)/bib/book[count(author)>3]
ORDER BY $x/@price descending
RETURN $x
CIS 5930/4930 – Fall 2021
Aggregates
Print all authors who published more than 3
books (no duplicates → distinct-values clause)
FOR $b IN document(“bib.xml”)/bib,
$a IN distinct-values($b/book/author/text())
WHERE count($b/book[author/text()=$a])>3
RETURN
CIS 5930/4930 – Fall 2021
Aggregates
Find books whose price is larger than average:
FOR $b in document(“bib.xml”)/bib,
$x in $b/book
LET $a:=avg($b/book/price/text())
WHERE $x/price/text() > $a
RETURN $x
CIS 5930/4930 – Fall 2021
Flattening
• “Flatten” the authors, i.e. return a list of
(author, title) pairs
FOR $b IN document(“bib.xml”)/bib/book,
$x IN $b/title/text(),
$y IN $b/author/text()
RETURN
Result:
CIS 5930/4930 – Fall 2021
Re-grouping
• For each author, return all titles of her/his
books
FOR $b IN document(“bib.xml”)/bib,
$x IN $b/book/author/text()
RETURN
{ FOR $y IN $b/book[author/text()=$x]/title
RETURN $y }
Result:
. . . .
CIS 5930/4930 – Fall 2021
Re-grouping
• Same, but eliminate duplicate authors:
FOR $b IN document(“bib.xml”)/bib
LET $a := distinct-values($b/book/author/text())
FOR $x IN $a
RETURN
{ FOR $y IN $b/book[author/text()=$x]/title
RETURN $y }
CIS 5930/4930 – Fall 2021
25 26
27 28
29 30
9/1/2021
6
XQuery: Standard Functions
CIS 5930/4930 – Fall 2021
• Divided into subsets by functionality
• Numeric (e.g. abs(), floor(), round())
• String (e.g. concat(), normalize-space())
• Duration, Date and Time (e.g. year-from-
date())
• Sequences (e.g. reverse())
• Accessors (e.g. data())
XQuery: User Defined Functions
CIS 5930/4930 – Fall 2021
• XQuery (depending on vendor) supports
many useful functions
• If you can not find a function, you can write
it yourself
• Reuse common expressions written in your
queries
• Make your queries easy to read
• Use recursion if necessary
XQuery: User Defined Functions
CIS 5930/4930 – Fall 2021
declare function
namespace:function_name(
$input1 as input_datatype1,
…)
as return_datatype
{
(: function body: for, let, where,
order by, return 🙂
};
31 32
33