COMP9321 Data Services Engineering
Term 1, 2021 Week 1: Data Access
Data Lakes and Data Pipelines
2
Data Services – what is it about?
Two sides of a coin:
Require specific access techniques
• Data integration/aggregation from multiple sources (data prep)
• Data publication for consumer access (API)
3
4
Challenges from implementation view point
Difficult to obtain a ”single view of X” for any X
• What data do I have about X?
• How do I stitch together the info I need (choose the right data model)?
• What else is X related to?
No uniformity (model or language)
• Data about X is stored in many different formats.
• Accessing or updating X involves many different APIs or access methods
• Manual coding of ”distributed query plans”
What’s data sources or existing APIs available and where? What protocol do they use?
What format are they in?
5
Obtaining Data
Useful data can be found in many places • on the Web, possibly via an API
• in documents in a file system
• in spreadsheets
• in videos
• etc. etc. etc.
and in a variety of formats • Unformatted text (in files)
• PDF documents (in files)
• HTML documents (web pages) • XML documents (via web APIs) • JSON data (often via web APIs) • CSV data files (spreadsheets)
6
Unformatted Text Data
Unformatted text is generally unfriendly
Hi James,
Here are some stats on in the first term of 2018 sales in Australia. Up by 20% on last year for our end point solutions
(backup solutions sales ranking highest).
And corporate solutions sales is up by 30% !!
Vulnerability scanning solutions ranking highest.
Best regards,
7
Unformatted Text Data (Cont’d)
The same information can be conveyed in many ways
Hey James,
FYI
Sales in Australia for the end point solutions increased by 20% where backup solutions being most popular.
And corporate solutions’ sales had a great jump by (40%) with an increase demand for vulnerability scanners
Regards, Hmmmmmmmmmm
8
Unformatted Text Data (Cont’d)
In order to mine useful data from text, sophisticated techniques are required like: • Natural language processing (NLP) (syntax)
• Machine Learning (ML) (patterns)
Such techniques are, at best, approximate
Specific problems …
• Named entity recognition (who/what is end point solutions?) • Training requirements (for ML)
9
PDF Documents
PDF documents have some structure
• which specifies content and layout commands
• including both text and (mainly) binary data (e.g. OCR) • but structure is not necessarily helpful for extraction
Python packages (e.g. pyPDF2) for dealing with PDF • aim to extract the text from the document
• but don’t aim to keep the layout structure
PDF Documents (Cont’d)
Why even extracting text from PDFs is hard …
• text chunks don’t necessarily correspond to words
10
11
PDF Documents (Cont’d)
More sophisticated extraction from PDF PDFminer (including pdf2txt.py and dumppdf.py)
• toolset implemented in Python (v2)
• parses PDF documents, both text & images
• can convert PDFs to other formats (e.g. HTML)
• can output text + layout information
But still requires significant work to extract data
• requires NLP/ML, but aided by layout information
PDF Documents (Cont’d)
Tables are often sources of useful data, but require … • finding table boundaries
• finding rows and columns
• finding cell boundaries
• extracting text from cells • etc. etc.
Above is easy in HTML Much harder in PDF
But e.g. pdftables.com
12
13
HTML Documents
HTML documents include explicit markup
• which is more semantic than PDF layout data
• making it easier to recognise document components
• but content is still semi-structured (“at creator’s whim”) However, much HTML these days is generated
• giving a regular structure which is parseable
14
HTML Documents (Cont’d)
Example HTML (2)
Sales
- Anti-virus solution 50000
- Backup solutions 150000
Easy to find structure; recognising columns harder.
15
HTML Documents (Cont’d)
Example HTML (2)
Sales
Solution | Quantity |
---|---|
Anti-virus | 50000 |
Backup | 150000 |
Easy to find columns; headings assist with semantics.
16
HTML Documents (Cont’d)
The Python BeautifulSoup library allows analysis of HTML Example:
Some simple content.
Assume above is in http://sec.com/p.html
17
HTML Documents (Cont’d)
import requests
from bs4 import BeautifulSoup
URL = ‘https://www.monster.com/jobs/search/?q=Data-Scientist&where=Australia’ page = requests.get(URL, verify = False)
soup = BeautifulSoup(page.content, ‘html.parser’)
job_elems = results.find_all(‘section’, class_=’card-content’)
for job_elem in job_elems:
title_elem = job_elem.find(‘h2′, class_=’title’) company_elem = job_elem.find(‘div’, class_=’company’) location_elem = job_elem.find(‘div’, class_=’location’) print(title_elem)
print(company_elem)
print(location_elem)
Can use text and structure to explore document
18
HTML Documents (Cont’d)
Combined with other Python modules e.g. regexps
• BeautifulSoup provides powerful tools to extract text • can also place text in context within page structure
• can allow extraction of structured data
– from within known page structures
– or using specific patterns of tags/text
– which is often the case nowadays
– when HTML is mostly generated by scripts
19
HTML Documents (Cont’d)
Example: UNSW Handbook
Easy to navigate and scrape because …
• all program/stream/course pages have same structure • significant amount of cross-linkage among pages
• index pages give links to all pages of given type https://www.handbook.unsw.edu.au/
HTML Documents (Cont’d)
Example: Bojangles (retires and replaced with Crossangles)
Timetable generator
implemented by
CSE student
using data scraped from classutil
20
Source: https://tdransfield.net/projects/bojangles/
21
XML and JSON Data
XML and JSON are already structured data
• good for representing hierarchical structure
• have tags to indicate type of data (metadata) • have much software to traverse their content Tags help to massage into target structure
XML and JSON Data (Cont’d)
Example XML
23
XML and JSON Data (Cont’d)
Example XML: xml.etree.ElementTree import xml.etree.ElementTree as ET tree = ET.parse(‘country_data.xml’) root = tree.getroot()
root.tag # displays ‘countries’ for c in root:
print c.tag c.attrib # displays …
# country {‘name’: ‘Liechtenstein’} # country {‘name’: ‘Singapore’}
# country {’name’: ‘Australia’}
24
XML and JSON Data (Cont’d)
Much useful data is available in JSON format
• typically from web service API’s
Python provides JSON library module (json)
• dump() serializes Python objects as JSON data
• load() converts JSON data into Python objects
• apply standard Python methods to load’ed objects
Example: https://hackerone.com/reports/328486.json
25
CSV Data
Most CSV data is effectively a (relational) table
• however may not be normalised (in RDB sense) Much CSV data is produced from spreadsheets
• column headings provide metadata (if available) Example:
“ZID”,“Name”,”Degree”,”WAM” “8787878”,”Al-Banna, Morty”,”1678”,”80.5” “9601234”,“Paik, Helen”,”8543”,”95.1”
26
CSV Data (Cont’d)
Python has several CSV modules
csv provides basic reading/writing of CSV data
• each row becomes a Python list
• collection of rows is a list of lists
pandas provides reading/writing CSV data
• also, an abstraction of the data (DataFrame)
• plus a range of operators for filtering/calculating
Example: http://www.abs.gov.au/browse?opendocument&ref=topBar
Back to Data Services
27
28
Challenge
How do we store and access this data ?
E
• Data operations are mainly transactions (Reads and “Writes”)
• Operations are mostly on-line
• Response time should be quick but important to maintain security and reliability of
the transactions.
• ACID properties are important
–
Commerce website
28
29
Challenge
How do we store and access this data ?
Image serving website (many social network sites in general)
•
• • •
Data operations are mainly fetching information (Reads)
• alsothe“fan-out”effectischallenge Operations are mainly on-line
High bandwidth requirement
ACID requirements can be relaxed
29
30
30
Designing Data-Intensive Applications, by Martin Kleppmann
Challenge
Fan-out …
A user can see tweets posted by the people they follow …
• A new post –> look up the followers and ‘write’ to each follower’s timeline ahead of time -> makes reading easy
• But this also creates a lot of ‘writing’ work
• On average 75 followers, but can vary widely (some users have 30 million followers)
• May need to consider the distribution of the followers per user (and how often each user tweets)
31
Challenge
• How do we store and access this data ?
Search Website
• Data operations are mainly reading index files for answering queries (Reads)
• Index compilation is performed off-line due to the large size of source data (the entire Web)
• ACID requirements can be relaxed
• Response times must be as fast as possible.
31
32
32
Designing Data-Intensive Applications, by Martin Kleppmann
Challenge for API …
How do we store and access this data over the web ?
• Consumption of Data (for you to take data in …)
• Publication of Data (for you to make data available for others …)
Important question: What is your data model behind the API?
Data models can change how we think about the problem you are solving
33
What is in a data model …
An application developer “thinks” in terms of the real world (people, organisations, actions, goods, etc.) … and model it as objects/data structures and APIs that manipulate them – these models are very specific to each application
When you want to store the objects, you express them in generic-purpose data model such as JSON, XML documents or tables.
The “storage” also allows the representation to be queried, searched or manipulated.
The engineers of the ‘storage solution’ software decide on how JSON/XML/tables are represented in terms of bytes in memory, disk or on a network.
Translations between application model to the generic purpose data model
to storage
33
Designing Data-Intensive Applications, by Martin Kleppmann
34
34
Relational Model vs. “NoSQL” Models
Relational Model (more or so synonymous with SQL)
• The best known, probably the most successful data model which has proven itself in many aspects to be the data model of choice in many applications
• Data is organised into relations (table) where each relation holds an unordered collection of tuples (rows)
Based on solid theory and well engineered implementation -> many competing models have been proposed, but never managed to take over SQL
Built for business data processing
• Typical business transactions (airline reservations, stock keeping, etc.)
• Batch processing (invoicing, payroll, reporting, etc.)
Turned out it was still generically applicable to many modern Web applications too
Designing Data-Intensive Applications, by Martin Kleppmann Image: http://gnosis.cx/publish/programming/xml_matters_8.html
35
Relational Model vs. “NoSQL” Models
The rise of NoSQL … (since 2010 or so)
• Refers to a host of technologies that implement distributed, “non-relational” databases
Why NoSQL?
• A need for greater scalability – very large datasets or very high ‘write’ throughput
• A need for more expressive and dynamic data model
• Usually do not require a fixed table schema nor do they use the concept of joins
• All NoSQL offerings relax one or more of the ACID properties
35
Designing Data-Intensive Applications, by Martin Kleppmann Image: https://www.slideshare.net/Dataversity/trends-in-data-modeling
36
Problems with Relational Models
36
Normalisation … 3NF
* many fragments -> leading to many joins -> scalability ?
Designing Data-Intensive Applications, by Martin Kleppmann Image: https://www.slideshare.net/Dataversity/trends-in-data-modeling
37
Problems with Relational Models
The Object-Relational Mismatch (Impedance Mismatch)
• •
Refers to the problem of a mismatch between application data model (your business objects) and data model for storage (in relational tables)
This mismatch creates a need for an awkward translation layer between the objects in the application code and the database model of tables/row/columns.
37
Designing Data-Intensive Applications, by Martin Kleppmann http://enterprisecraftsmanship.com/2016/11/03/oop-fp-and-object-relational-impedance-mismatch/
38
Alternative Data Models?
Relational Modelling
of a resume (e.g., LinkedIn Profile)
38
Designing Data-Intensive Applications, by Martin Kleppmann (Chapter 2)
Typical normalised form would put multi-values in separate tables with user_id as foreign key
Fragmented tables -> join
39
Added features in SQL …
Some databases support an idea similar to ‘Arrays’:
• an store multi values in a single row
• can be queried and indexed
User_id
…
Job_title
School_name
251
{co-chair, Bill & Melinda Gates ..}, {Chairman, Microsoft)
{Havard University, 1973,1975},{Lakes ide School, Null, Null}
39
Designing Data-Intensive Applications, by Martin Kleppmann (Chapter 2)
40
Alternative Data Models?
40
Designing Data-Intensive Applications, by Martin Kleppmann
Another option:
• Encodes jobs, education, contact info as a JSON (or XML) document
• Stores the whole document in a text column in the database
• Application code accessing this info will have to deal with the structure as a whole
• You cannot use the database to query for values inside the column
Document-based databases support this idea naturally (e.g., MongoDB – insert/query JSON objects)
41
Document-based databases
MongoDB (the most well-known example)
Notable points:
• Collections do not enforce a schema. Documents within a collection can have different fields. Typically, all documents in a collection are of similar or related purpose
• No joins (everything embedded in a single object)
41
https://www.tutorialspoint.com/mongodb/mongodb_overview.htm
42
Document-based databases
42
Designing Data-Intensive Applications, by Martin Kleppmann
Embedded objects normally are the result of One-to-Many relationships
Improved “locality”
• a single retrieval request is enough to get all necessary info on “User”
The mismatch between application data model and storage-purpose data model is significantly reduced
• “Create a User” (JSON) in app code and “Insert a User” (JSON) into Document Collections
43
Document model is not good with …
43
What about Many-to-One or Many-to-Many?
The relational model based solution of these “look-up tables” are useful:
• Consistent style and spelling across Users
• Avoiding ambiguity (e.g., if several cities with the same name)
• Ease of updating (the name is stored in only one place)
• Better search – a search for philanthropists in the state of Washington can match this User 251 (via another table)
Storing ID vs Text -> NOT duplicating text is more flexible and keeps data consistent – reason for normalising in RDB Designing Data-Intensive Applications, by Martin Kleppmann
44
Document model is not good with …
The single “documents” tend to become more interconnected as more features are added
44
Designing Data-Intensive Applications, by Martin Kleppmann
The company – linking it as a full entity by itself (Many-to-One Relationship)
The recommendations – linking it to other Users
(Many-to-Many Relationships)
45
45
Designing Data-Intensive Applications, by Martin Kleppmann
Relational vs. Document
When it comes to representing many-to-one and many-to-many relationships, both are not that different …
• Foreign keys (ID references) in relational
• Document references (Doc ID) in document-based
The IDs are resolved at retrieval time by using a join or follow-up queries.
• •
But joins on M-M or M-1 relationship are a routine – highly optimised at the database level
Document models – join support could be weak, application code might have to resolve the relationships as needed
46
46
Designing Data-Intensive Applications, by Martin Kleppmann
Relational vs. Document
Which data model leads to simpler application code?
• If the application data model looks like a tree (document-like) -> it can be loaded at once using document-based model
• If M-M relationships are central to the application data model -> relational model is efficient in joins. If document model is used, some of the ‘join’ logic will have to move to application code
Consider the kinds of relationships between data items. If they are highly interconnected data (e.g., social network)
• document model is not so good,
• relational model is OK …
• graph models would be natural (to be seen later)
47
47
Designing Data-Intensive Applications, by Martin Kleppmann
Relational vs. Document
Schema flexibility, always a good thing?
• Most document-based databases do not enforce any schema in documents (schema-less databases)
• Arbitrary keys and values can be added to a document and when reading clients have no guarantees as to what fields the documents may contain
• Schema-on-read
– The structure of the data is implicit, only interpreted when the data is read by application code – ≈ dynamic (runtime) type checking
• Schema-on-write
– The traditional approach of RDB – explicit schema and the database ensures all written data conforms to it – ≈ static (compile-time) type checking
48
48
Designing Data-Intensive Applications, by Martin Kleppmann
Relational vs. Document
Schema flexibility, always a good thing?
• When does this ‘schema-on-read/write’ matter? -> when application wants to change the format of its data.
• E.g., User name in one field -> User name in two fields.
(Relational DB)
DOC model is considered advantageous if the docs in the collection tend to have different structures (e.g., different types of related objects)
(Document Based)
49
Relational vs. Document
Data locality for queries – doc-based systems store a document as a single continuous string as JSON or XML (or a binary variant)
If your application requires the entire document (e.g., to render it on a Web page as a whole), there is a performance advantage over split tables
49
Designing Data-Intensive Applications, by Martin Kleppmann
50
Relational vs. Document
The locality of Doc-based systems
Data locality advantage only applies if you need large parts of the document at a time (often the whole document needs to be loaded only if you need to access a small portion of it)
On Updates, normally the whole document needs to be rewritten (except tiny changes that do not change the overall encoded size of the document)
50
Designing Data-Intensive Applications, by Martin Kleppmann
51
Relational vs. Document
51
Convergence of document and relational databases
PostgreSQL (since v.9.3), MySQL (since v.5.7). IBM DB2 (since v.10) support JSON documents.
RethinkDB, MongoDB (document-based) support relational-like joins in its query language
The two models can complement each other -> A hybrid model seems like a trend in these two systems
Designing Data-Intensive Applications, by Martin Kleppmann https://www.zdnet.com/article/the-emergence-of-nosql-and-convergence-with-relational-databases/
Graph-like Models
M-M relationships are an important factor in deciding which data model to go with
1-M (tree/doc), self-contained -> Document model
M-M -> either relational or graph
Highly M-M, complicated connections -> graph …
Graph:
• •
Vertices/nodes: represent entities
Edges/arcs: represent relationships
The recommendations – linking it to other Users
(Many-to-Many Relationships)
User User
52
52
Designing Data-Intensive Applications, by Martin Kleppmann
Graph-like Models
Many kinds of data can be modelled as a graph
• Social Graph – vertices are people, edges indicate which people know each other
• The Web Graph – vertices are web pages and edges indicate HTML links to other pages
• Road or Rail networks – vertices are junctions and edges represent the roads/railways between them
Well-known algorithms on the model
http://www.supplychain247.com/article/why_supply_chains_should_be_more_socially_engaged http://canacopegdl.com/single.php?id=http://www-inst.eecs.berkeley.edu/~cs61bl/r//cur/graphs/web.graph.png
53
53
https://visualign.wordpress.com/2012/07/11/london-tube-map-and-graph-visualizations/ Designing Data-Intensive Applications, by Martin Kleppmann
Graph-like Models
54
54
Designing Data-Intensive Applications, by Martin Kleppmann
Vertices are not limited to the same type of data.
Graph-like Models
Facebook, TAO system (2013)
55
55
Designing Data-Intensive Applications, by Martin Kleppmann
(https://www.usenix.org/system/files/conference/atc13/atc13-bronson.pdf)
Storing and Querying Graph-like Models
Property Graph model:
Each vertex:
• Identifier
• A set of outgoing edges
• A set of incoming edges
• A collection of properties (key- value pairs)
Each edge:
(e.g., PostgreSQL, using json type)
56 56
Designing Data-Intensive Applications, by Martin Kleppmann
• •
•
• •
Identifier
The vertex at which the edge starts (tail)
The vertex at which the edge ends (head)
A label for the relationship A collection of properties
Storing and Querying Graph-like Models
Property Graph model:
Any vertex can have edges (no schema-based restriction on what kinds of ‘things’ can be connected)
Given any vertex, you can efficiently find both incoming and outcoming edges – traversing the graph
By using different labels for different types of relationships, you can store several different kinds of information in a single graph
These features give graphs a great flexibility for data modelling
(e.g., PostgreSQL, using json type)
57
57 Designing Data-Intensive Applications, by Martin Kleppmann
So graphs are “very” flexible … (cf. RDB)
58
58
Designing Data-Intensive Applications, by Martin Kleppmann
• Differentkindsofregionalstructuresindifferentcountries
• Type country “within” a type country
• Varyinggranularity(e.g.,born_in“type:state”,lives_intype:city)
Accessing DB from an application …
When you work with a database system (regardless of its storage model) in an application, the code issues a query statements to the database via some form of “data-connectivity API”
The application code blocks relating to using this library form “Data Access Layer” in the stack.
• For objects to persists, we need to convert the object values into the values that can be stored in the storage and convert them back upon retrieval.
• This should be done while preserving the properties of the objects and their relationships
59
59
Impedance (or Paradigm) Mismatch Problem
60
60
Impedance Mismatch Problem
Granularity Problem
How should this be represented in relational tables?
•
• •
Should we add an Address table?
Should we add an Address column to the User table instead? Should the Address be a string? Or multi-columns?
• CoarseGranularity,asasinglefield • FineGranularity,asmultiplefields
address = 200 2nd Ave. South #358, St. Petersburg, FL 33701-4313 USA
61
61
street address = 200 2nd Ave. South #358 city = St. Petersburg
postal code = FL 33701-4313
country = USA
62
62
Impedance Mismatch Problem
In application code:
Objects can be either equal or identical:
identical = same object (address)
equal = same values
Identity Concept Mismatch
In RDB, these two separate concepts do not exists. There is only one concept of identify = primary key. (i.e., same primary key -> same objects)
Potentially problematic, if duplicate objects are considered the same object (or vice versa) in database
Impedance (or Paradigm) Mismatch Problem
Association Problem
63
63
Impedance Mismatch Problem
In OO, method chaining like:
Object Graph Navigation
User.getBillingDetails().getAccountNumber() is commonly done …
From a user, you access the billing information, from that, you access the account number …
However, this is not an efficient way to retrieve data from relational tables (i.e., instead of accessing single objects, you’d do joins …)
64
64
65
Query languages for data
Most query languages are Declarative:
• Specify the pattern of data to be returned, not how it is returned
• The database is optimised on how to do this
This declarative query paradigm is the same in Relational or Document-based systems
Most programming languages are imperative:
• step-by-step instructions on how the data should be returned …
65
Designing Data-Intensive Applications, by Martin Kleppmann
66
Query languages for data
Most query languages are imperative:
• Specify the pattern of data to be returned, not how it is returned
• The database is optimised on how to do this
Declarative query paradigm is good for many reasons:
It is up to the database to decide which index tables to consult, which joins should be performed, in which order the various parts of the query are executed
The query language can remain concise
Any further optimisation/performance improvement of the database system can happen without affecting the query interface (e.g., re-arranging disk space)
Suitable for parallel execution strategy …
66
Designing Data-Intensive Applications, by Martin Kleppmann
67
Query Languages for Data
Accessing DB from an Application:
When you work with a database system (regardless of its storage model) in an application, the code issues a query statements to the database via some form of “data-connectivity API”
Database connectivity API specifications
• Java has JDBC API, Python has DB-API, Microsoft variety has ODBC API, etc.
Each specification is then implemented by the database system provider as a library for the developers (e.g., DB-API library for PostgreSQL, or JDBC library for Oracle)
The application code blocks relating to using this library form “Data Access Layer” in the stack.
67
Designing Data-Intensive Applications, by Martin Kleppmann
68
Directly Executing SQL
DBAPI – e.g., psycopg2
Connect
(network/or file handle)
“An object for
Table rows and loops within them”
Bound parameter
68
69
Directly Executing SQL
DBAPI
• Many different implementations of the spec …
• Inconsistency between different implementations (e.g., bound parameter formats, exception hierarchy)
• No explicit transaction markers (no begin() transaction)
69
A range of services built on top of DBAPI
70
Directly Executing SQL
SQLAlchemy
• “Uniform” SQL access to relational databases (in SQLAlchemy way)
• i.e., SQL access library built on top of the DBAPI connectivity
from sqlalchemy import create_engine
engine = create_engine(‘postgresql://usr:pass@localhost:5432/sqlalchemy’) …
engine = create_engine(‘sqlite:///some.db’)
Object that maintains
The classical DBAPI interactions With the database
70
71
Directly Executing SQL
•
Connection Pooling (≈ connection sharing) • CreatingDBconnectionsareexpensive
• With pooling, program fetches an existing connection, use and put it back into pool
• easier management of the number of connections that an application might use simultaneously
SELECT TOP 10 * FROM people;
Or
SELECT * FROM people LIMIT 10;
71
Directly Executing SQL (SQLAlchemy Core)
SQL Expression Language
72
72
http://docs.sqlalchemy.org/en/latest/core/tutorial.html
73
MongoDB Query
The basic idea of databased connectivity API applies with MongoDB too … Many implementations
• Direct: PyMongo, Motor
• ORM-like: PyMODM, MongoEngine, etc.
73
https://www.mongodb.com/blog/post/getting-started-with-python-and-mongodb
JSON Documents
as the first class citizens
74
MongoDB Query
MongoDB Server (download, install, run …) and MongoDB Client (connect, create db, …)
Create one if the collection doesn’t exist
74
https://docs.mongodb.com/manual/crud/
Querying Graph-like Models
Cypher Query – declarative query language for graphs https://neo4j.com/developer/example-project/
75
75
Designing Data-Intensive Applications, by Martin Kleppmann
76
76
Designing Data-Intensive Applications, by Martin Kleppmann
“Find the names of all the
people who emigrated from the United States to Europe”
“Declarative” language ->
the execution details hidden
Doing the same in relational
tables and SQL … ??
77
77
Designing Data-Intensive Applications, by Martin Kleppmann
78
Object Relational Mapping (ORM)
e.g., Python SQLAlchemy
(Optional)
SQL access management
libraries”
Python standard/core Low-level database communication layer (many implementations)
• psycopg2 (PostgreSQL)
• cx_oracle (Oracle)
• pyodbc (ODBC)
79
79
80
80
http://techspot.zzzeek.org/category/sqlalchemy/
ORM
ORM is the process of associating object oriented classes (your application domain model) with database tables
Application
Domain Objects
Object.save()
Database
Table Rows
Class.load()
ORM
Some ORM can represent arbitrary rows as domain objects – e.g., rows derived from SELECT statement joining multiple tables
Application
Domain Objects
Database
SELECT statement Table 1 Table 2
81
81
http://techspot.zzzeek.org/category/sqlalchemy/
Object.save()
Class.load()
SQLAlchemy ORM
ORM builds on SQLAlchemy Core
In contrast to the SQL Expression Language which presents schema-centric view of the data, ORM provides domain-model centric view of the data
82
82
http://techspot.zzzeek.org/category/sqlalchemy/
After this, User class now has an associated Table called ‘user’
83
83
http://techspot.zzzeek.org/category/sqlalchemy/
Managing and Publishing Metadata
Metadata, what is it?
Managing and Publishing Metadata
Nearly every device we use relies on metadata or generates it …
Edward Snowden – a contractor at United States National Security Agency exposed how the agency collected metadata on telephone calls directly from telecommunications companies (note: only metadata, not the actual conversations)
But how much information could be inferred about individuals from only metadata ?
Possible metadata collected about a phone call:
• Phone numbers (caller, recipient)
• Time and duration of the call
• Mobile phone locations (caller, recipient)
• If mobile phone is in connection with local cell towers, a record of your location at any given moment …
Metadata is becoming as important as the data itself. Naturally, data services APIs should be aware of metadata and know how to publish and consume metadata along with the data.
84
84
Metadata (The MIT Press Essential Knowledge Series)
Managing and Publishing Metadata
Librarians have been working with metadata for centuries …
In the end, they are data – which can be modelled, stored and managed now …
85
85
Metadata (The MIT Press Essential Knowledge Series)
86
86
Metadata (The MIT Press Essential Knowledge Series)
Managing and Publishing Metadata
Why do we need metadata when we have data object itself?
•
Metadata is a “map”, is a means by which the complexity of an object is represented in a simpler form
• A roomful of books is not called a library, books + catalog is. The catalog provides a simplified representation of the materials in the library collection.
• Primarily, metadata helps with ‘resource discovery’ – the process by which information resources that might be relevant to your need is identified.
Descriptive metadata: description of an object
Administrative metadata: information about the origin and maintenance of an object
e.g., a photograph digitized using a specific type of scanner at a particular resolution, with some restrictions on copyright, etc.
Structural metadata: information about how an object is organised (e.g., ToC) Provenance metadata: traces/processes involved in producing the object.
Describing Description …
Metadata is a statement about a potentially informative “thing” (resource).
A well adopted metadata description language is RDF (resource description framework)
Subject: Mona Lisa Predicate: Creator Object: Da Vinci
Subject refers to the ‘entity’ being described
Object refers to another entity being used to describe the subject …
RDF Triples – could be a useful data model just by itself … (graphs -> network analysis -> gets interesting …!!)
87
87
https://www.w3.org/TR/rdf11-primer/
88
88
Pomerantz, Jeffrey. Metadata (The MIT Press Essential Knowledge series)
Descriptive Metadata
Standard, the simplest form of metadata: Dublin Core
Originally developed to help improve the search engine and indexing the web documents
Title: Mona Lisa
Title: La Gioconda Creator: Leonardo da Vinci Subject: Lisa Gherardini Date: 1503–1506
RDF Example (with Dublin Core)
https://www.w3schools.com/xml/xml_rdf.asp
89
89
Metadata (The MIT Press Essential Knowledge Series)
90
Questions?