Information Management
Giovanni ̀ degli Studi di databases
• Characteristics
• Schema oriented
Copyright By PowCoder代写 加微信 powcoder
• ACID properties
• Keep large amounts of persistent data
• Thanks to transactions, they limit the complexity of the management of concurrency
• Integration among multiple applications made possible through a shared database
• (Mostly) standard relational model
• Relational algebra for accessing data
Semi-structured data
(thanks to P. Samara5)
(some examples taken from Database Systems – Concepts, Languages and Architectures, P. Atzeni, S. Ceri, S. Paraboschi and R. Torlone, McGraw-Hill)
Semi-structured data
• Do not have a schema, or can respect it partly
• XML: eXtensible Markup Language
• Data properties expressed through markup of XML documents content
• A model for the data can be specified (DTD, XML Schema) • There are standardized query languages
• Data separated from metadata
• JSON: JavaScript Object Notation
• Derived from JavaScript
• Key-value pairs
• Data are organized in batches and nested objects
XML syntax
• XML files are composed of nested tags • Each XML document has a root tag
• All elements are enclosed in tags
• Elements
• Can be extended, can include children elements • Have a content (can be empty)
• Attributes
• Are included in elements, cannot have children
• Associate a value to elements without being part of their contents • Have limitations
XML – example (well-formed)
Namespaces and Schemas
• Namespaces avoid ambigui0es for elements/a7ributes names • Collec0ons of names iden0fied by URI
•
• Schemas permit to define a model for documents
• Tags and their proper0es, cardinali0es, domains, default values • A document can be validated w.r.t. a schema
• Schemas can be defined with DTD and XML Schema • DTD simpler, but XML Schema more flexible
• Permits to define elements and their cardinalities, and attributes •
• Can specify whether it contains children and their number
•
• Types and constraints (required, implied, fixed)
DTD – example
format (paperback|hardback) ‘paperback’
XML Schema – complex elements
• Complex elements
• Can contain attributes and other elements • Can have indicators
• any, all, choice, sequence: ordering among elements • {min,max}Occurs: min/max number of occurrences, grouping
•
XML Schema – attributes
• Attributes
• Can have default and fixed values
• Can be optional
• Can have restrictions (e.g., min/max value)
•
XML Schema – example
Querying XML documents
• Idea: XML as a tree structure to be visited
• Queries are based on XPath and FLWOR expression
• For, Let, Where, Order, Return
• Defines path expressions selecAng the objects in a path
• Can return a set of nodes, a Boolean value, numbers, strings • Paths are evaluated w.r.t. a context node
• Current node (.), parent node (..)
• Child of a node (/), descendant of a node (//), any node (*) • ANribute of current node predicate ([]), posiAon ([n])
XPath – examples
doc(“books.xml”)/List/Book
doc(“books.xml”)/List/Book[Publisher= “Feltrinelli” AND @availability=“Y”] /Title
doc(“books.xml”)//Author doc(“books.xml”)/List/Book[1]/*
XQuery – FLWOR expressions
• FOR, LET
• They declare variables, difference in bindings
• It expresses conditions filtering objects produced by LET and FOR
• It orders objects produced by LET and FOR
• It generates the result of the query (node, forest, or value, and can include element constructors and references to variables)
XQuery – example (1)
FOR $lib IN doc(“books.xml”)//Book
FOR $auth IN $lib/Author
RETURN $auth
LET $lib := doc(“books.xml”)//Book
RETURN $lib
FOR $lib IN doc(“books.xml”)//Book
WHERE $lib/Publisher=”Bompiani”
RETURN $lib
XQuery – example (2)
FOR $lib IN doc(“books.xml”)//Book
WHERE $lib/Publisher=”Bompiani”
RETURN
FOR $lib IN doc(“books.xml”)//Book
ORDER BY(Title ASCENDING)
XQuery – example (3)
let $author := doc(“catalog.xml”)//author
for $name in distinct-values($author/text())
order by $name
return
for $book in doc(“catalog.xml”)//libro
where $book/author/text()=$name
} return $book/title
XML Vs. DBMS
• RDBMS with extended capabilities
• Interfaces for translating XML to/from relational
• Interfaces for translating XML queries to relational queries • (Loose) mapping between schemas
• Native XML databases
• They lose RDBMS functionalities but natively manage XML structures (e.g.,
element inclusion, ordering among siblings)
• Appealing when data are poorly structured (e.g., schelamess)
A bit of history
• 1970/1980: Relational databases • Storage is expensive
• Dara are normalized
• Data are stored regardless of how they will be used • RDBMS become popular
• Client/server model
• SQL becomes a standard for querying databases
• 1990: WWW and Internet
• 2000: Web 2.0
• Storage is less expensive
• e-Commerce, social mediaàdata growàneed to scale with data!
A necessary introduction
• NoSQL does not mean Not SQL but it is more likely to stand for “not relational”
• NoSQL is now interpreted as “Not Only SQL” • It permits to use SQL-like queries
• NoSQL is not a single product but a collection of diverse, and sometimes related, concepts about data storage and manipulation
NoSQL: definition
• There is not a generally accepted definition in the literature
• Characteristics of NoSQL
• schemaless
• not using only SQL
• generally open-source (even though the NoSQL notion is also applied to
closed-source systems)
• generally driven by the need to run on clusters (but graph databases do not typically fall in this class)
• generally not handling consistency through ACID transactions (but graph databases instead do it)
NoSQL models
• There exist different kinds of NoSQL systems, and each family presents different variations
• The most important families of NoSQL databases are: • Key-value
• Document-based • Column-oriented • Graph-based
• Aggregate-oriented Vs. graph-based
Aggregate orienta+on
(partly taken from NoSQL Dis*lled: A Brief Guide to the Emerging World of Polyglot Persistence, P. J. Sadalage, M.Fowler, Addison-Wesley)
Impedance mismatch (1)
• Difference between the relational model and the in-memory data structures
• In-memory structures are more flexible (e.g., they can be nested)
• To use more flexible in-memory structures, it is necessary to translate them
in a relational representation
• Impedance mismatch more relevant with the development of object-
oriented programming languages
• Introduction of object oriented databasesàfailure
• Impedance mismatch easier to deal thanks to object-relational mapping frameworks
• Not a real solution!
Impedance mismatch (2)
• Example of single aggregate structure mapped in many relational tables
Scalability issues
• Due to the considerable increase in the amount of data to which we assisted in 2000s, scalability is paramount
• Vertical scalability: more powerful machinesàexpensive
• Horizontal scalability: clustersàless costly and more reliable
Scalability issue: clusters
• Clusters are more suitable to the emerging scenarios (e.g., data generated by social networks)
• RDBMSs have not been designed to operate on clusters • Designed as single-server
• Need to think of an alternaDve to RDBMS for data management
Aggregate orientation
• Intuition: operate on data in units with a more complex structure than a tuple
• Aggregate: a collection of related objects to be treated as a unit • goal 1: update aggregates atomically
• goal 2: communicate with storage in terms of aggregates
• Aggregates fit distributed scenarios
• natural unit for sharding and replication (more on this later)
Aggregate orientation (example)
• Application: e-commerce, need to store information on customers, products, orders, shipping addresses, billing addresses, payment data
• Relational modeling
• No data is replicated (normalization)
• Referential integrity (foreign key constraints)
• DBMS cannot use knowledge of the aggregates for storage
Aggregate orientation (example)
• Application: e-commerce, need to store information on customers, products, orders, shipping addresses, billing addresses, payment data
• JSON format (excerpt)
• Two aggregates: customer and order • Customer contains addresses
• Order contains payments that contains addresses
To aggregate or not to aggregate?
• Data are organized depending on how they will be accessed
• Aggregation is not a property of the data, but of how data will be
used by applications
• focus on the unit of interaction with the storage
• Not always a good idea:
• A given aggregate structure can be an obstacle with a given application (-) • Fits well with operations on clusters (+)
CAP theorem (1)
• Aggregate-oriented databases and ACID properties are not a good match
• CAP theorem (E. Brewer, 2000)
“Of three properties of shared-data systems (Consistency, Availability and tolerance to network Partitions) only two can be achieved at any given moment in time.”
CAP theorem (2)
Consistency
Availability
Par11on Tolerance
CAP theorem (3)
• Consistency
• Every request receives the correct response
• Once data has been written, all future read requests operate on this version of the data
• Availability
• The data are available and responsive
• Each request eventually receives a response
• If you can access a node in the cluster, it can read and write data
• Tolerance to network partitions
• The cluster can survive to partitioning of the network that break the cluster in multiple partitions that cannot communicate with each other
CAP theorem: example (1)
• Two nodes with a replica of the data, having value V0 initially
• N1 runs algorithm A writing a new value V1 • N2 runs algorithm B reading the data value
CAP theorem: example (2)
• Ideal immediate propagation
CAP theorem: example (3)
• In real world scenarios, propaga1on is not immediate!
CAP theorem: example (4)
If we want a system
• highly available
• composed of a large number of nodes
• where each node resists to network problems
then we have to accept that sometimes • N1 might see V1
• N2 might see V2
CAP theorem: how to deal with it? (1)
Renounce to tolerance to network partitions
• Single server CA system
• Resistance to network partitioning not requested • A single machine cannot partition
• CA cluster
• If a partition ever happens, all the nodes in the cluster go down (no impact on
the CAP theorem’s definition of Availability) • Hard to guarantee
• Operate on a single node
CAP theorem: how to deal with it? (2)
Renounce to consistency or availability
• Solution adopted by NoSQL systems
• They trade-off between consistency and availability
• Not always a Boolean decision, trade a little consistency for a little availability • How much you can trade depends on the specific application domain
• Book a room via hotel booking web site replicated at two locations • C:bothnodesagreeontheserializationofrequests
• NetworkpartitioningwouldcompromiseA
• IncreaseA:master-slaveapproach
• FurtherincreaseA:stillacceptbookinglocally,resolveincaseoflastroom
CAP theorem: how to deal with it? (3)
• NoSQL is a varied world
• In general aggregate-oriented databases support atomic manipula:on of a
single aggregate at a :me
• Atomic manipula:on of mul:ple aggregatesàmanaged in the applica:on
• Considera:ons on where atomicity is wanted is part of the strategy to divide data into aggregates
CAP theorem: conclusions
• Better think about the tradeoff consistency and latency
• We can always improve consistency by involving more nodes, but
each node increases the response time
• We can then think of availability as the limit of latency that we can
tolerate: once latency gets too high, we consider the data unavailable
• Every system should be designed to ensure both C and A in normal situations
• When a partition occurs, the decision between C and A can be taken
• When the partition is resolved the system takes corrective actions coming back to work in normal situation
From ACID to BASE properties
BASE: Basically Available, Soft state, Eventual consistency
• Basically available: the system is available most of the time and there
could exist a subsystem temporarily unavailable
• Soft state: data are volatile in the sense that their persistence is in the hand of the user that must take care of refreshing them
• Eventual consistency: the system eventually converges to a consistent state (lazy updates, already discussed for distributed databases)
Key-Value databases (1)
• Used when all accesses to the database are via a primary key, associative arrays
• Think of a relational table with two attributes: • ID: has unique values
• NAME: attribute of data type string
• The client can:
• Get the value of a key
• Set a value for a key
• Delete a key (and its value)
• Simple hash tableàCost of operations O(1)
• Examples: Riak, Redis, B, Amazon DynamiDB, Voldemort (used by LinkedIn), MemcacheDB, …
Key-Value databases (2)
• Using a simple hash table, keys are stored in buckets
• The value associated with a key is an obscure blob, which can hardly be analyzed by the DBMS
• Can be any kind of data structure (e.g., sets, hashes, etc.) • Can be elaborated by applicaDons
• Very widely used for their simplicity and scalability
• CAP: some systems permit to set values for • N: number of nodes for replica storage
• R: number of nodes to be read for a successful read
• W: number of nodes to be wriLen for a successful write (relevant for consistency)
Key-Value features
• Consistency: applies only for operations on a single key • Riak implements eventual consistency
• Transactions: some key-value stores have a loose definition of transaction, with however no guarantee on write operations
• Structure of data: the value associated with a key can have arbitrary structure (XML, JSON, text, …)
• Scaling: key-value systems scale well
• Some products apply sharding (i.e., the key value determines the node in the
cluster where data are stored)
• Query: can operate only on the key
Key-Value: use cases
Suitable for:
• Store session information
• User profiles and user preferences • Shopping cart data
Not suitable for:
• Relationships among data
• Multioperation transactions
• Query by data
• Operations by sets
Document databases (1)
• Document databases store documents in the value part of the key- value store
• Key-value stores where the value is examinable.
• The database stores and retrieves documents
• The documents stored are similar to each other but do not have to be exactly the same
• Examples: MongoDB, CouchDB, Terrastore, OrientDB, RavenDB
Document databases (2)
• There are no empty/null a0ributes: if an a0ribute is not found, it was not set or not relevant to the document
• New a0ributes can be created without the need to define them or to change the exis?ng documents
• Embedding child documents as sub- objects inside documents provides for easy access and be0er performance
{ “firstname”: “Martin”,
“likes”: [ “Biking”,
“Photography” ],
“lastcity”: “Boston”,
“lastVisited”:
“firstname”: “Pramod”,
“citiesvisited”: [ “Chicago”, “London”, “Pune”],
“addresses”: [
{ “state”: “AK”,
“city”: “DILLINGHAM”,
“type”: “R”
{ “state”: “MH”,
“city”: “PUNE”,
“type”: “R” }
“lastcity”: “Chicago”
Document database features (1)
• Transactions: transactions operating on a single document are generally considered atomic
• Availability: document databases try to improve on availability by replicating data using the master-slave setup
Document database features (2)
• Consistency: MongoDB uses replica sets for managing consistency and tuning it
• Wait, for each write operation, that it is propagated to a given number of slaves
• The higher the number of slaves, the higher
consistency guarantees
Document database features (3)
• Query: can query the data inside the document without having to retrieve the whole document by its key
• MongoDB has a query language which is expressed via JSON • CouchDB supports (materialized and dynamic) views
• Scaling: depending on which kind of action is growing
• Increase the number of slave copies to support more reads
• Shard (i.e., horizontally fragment) and properly allocate data to support more writes
• Each shard can be a replica set to improve reads within
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com