A bit of history
• 1970/1980: Rela%onal databases
• Storage is expensive
Copyright By PowCoder代写 加微信 powcoder
• Dara are normalized
• Data are stored regardless of how they will be used • RDBMS become popular
• Client/server model
• SQL becomes a standard for querying databases
• 1990: WWW and Internet
• 2000: Web 2.0
• Storage is less expensive
• e-Commerce, social mediaàdata growàneed to scale with data!
A necessary introduc/on
• NoSQL does not mean Not SQL but it is more likely to stand for “not relational”
• NoSQL is now interpreted as “Not Only SQL” • It permits to use SQL-like queries
• NoSQL is not a single product but a collection of diverse, and sometimes related, concepts about data storage and manipulation
NoSQL: defini-on
• There is not a generally accepted definition in the literature
• Characteristics of NoSQL
• schemaless
• not using only SQL
• generally open-source (even though the NoSQL notion is also applied to
closed-source systems)
• generally driven by the need to run on clusters (but graph databases do not typically fall in this class)
• generally not handling consistency through ACID transactions (but graph databases instead do it)
NoSQL models
• There exist different kinds of NoSQL systems, and each family presents different variations
• The most important families of NoSQL databases are: • Key-value
• Document-based • Column-oriented • Graph-based
• Aggregate-oriented Vs. graph-based
Aggregate orienta+on
(partly taken from NoSQL Dis*lled: A Brief Guide to the Emerging World of Polyglot Persistence, P. J. Sadalage, M.Fowler, Addison-Wesley)
Impedance mismatch (1)
• Difference between the rela/onal model and the in-memory data structures
• In-memory structures are more flexible (e.g., they can be nested)
• To use more flexible in-memory structures, it is necessary to translate them
in a rela/onal representa/on
• Impedance mismatch more relevant with the development of object- oriented programming languages
• Introduc/on of object oriented databasesàfailure
• Impedance mismatch easier to deal thanks to object-rela/onal mapping frameworks
• Not a real solu/on!
Impedance mismatch (2)
• Example of single aggregate structure mapped in many rela5onal tables
Scalability issues
• Due to the considerable increase in the amount of data to which we assisted in 2000s, scalability is paramount
• Vertical scalability: more powerful machinesàexpensive
• Horizontal scalability: clustersàless costly and more reliable
Scalability issue: clusters
• Clusters are more suitable to the emerging scenarios (e.g., data generated by social networks)
• RDBMSs have not been designed to operate on clusters • Designed as single-server
• Need to think of an alternaDve to RDBMS for data management
Aggregate orienta+on
• Intui&on: operate on data in units with a more complex structure than a tuple
• Aggregate: a collec&on of related objects to be treated as a unit • goal 1: update aggregates atomically
• goal 2: communicate with storage in terms of aggregates
• Aggregates fit distributed scenarios
• natural unit for sharding and replica&on (more on this later)
Aggregate orientation (example)
• Applica’on: e-commerce, need to store informa’on on customers, products, orders, shipping addresses, billing addresses, payment data
• Relational modeling
• No data is replicated (normalization) • Referential integrity (foreign key
constraints)
• DBMS cannot use knowledge of the aggregates for storage
Aggregate orientation (example)
• Application: e-commerce, need to store information on customers, products, orders, shipping addresses, billing addresses, payment data
• JSON format (excerpt)
• Two aggregates: customer and order • Customer contains addresses
• Order contains payments that contains addresses
To aggregate or not to aggregate?
• Data are organized depending on how they will be accessed
• Aggrega6on is not a property of the data, but of how data will be used by applica6ons
• focus on the unit of interac6on with the storage
• Not always a good idea:
• A given aggregate structure can be an obstacle with a given applica6on (-) • Fits well with opera6ons on clusters (+)
CAP theorem (1)
• Aggregate-oriented databases and ACID proper3es are not a good match
• CAP theorem (E. Brewer, 2000)
“Of three proper+es of shared-data systems (Consistency, Availability and tolerance to network Par++ons) only two can be achieved at any given moment in +me.”
CAP theorem (2)
Consistency
Availability
Par11on Tolerance
CAP theorem (3)
• Consistency
• Every request receives the correct response
• Once data has been wri7en, all future read requests operate on this version of the data
• Availability
• The data are available and responsive
• Each request eventually receives a response
• If you can access a node in the cluster, it can read and write data
• Tolerance to network par??ons
• The cluster can survive to par??oning of the network that break the cluster in mul?ple par??ons that cannot communicate with each other
CAP theorem: example (1)
• Two nodes with a replica of the data, having value V0 ini8ally
• N1 runs algorithm A wri8ng a new value V1 • N2 runs algorithm B reading the data value
CAP theorem: example (2)
• Ideal immediate propagation
CAP theorem: example (3)
• In real world scenarios, propaga1on is not immediate!
CAP theorem: example (4)
If we want a system
• highly available
• composed of a large number of nodes
• where each node resists to network problems
then we have to accept that some9mes • N1 might see V1
• N2 might see V2
CAP theorem: how to deal with it? (1)
Renounce to tolerance to network par//ons
• Single server CA system
• Resistance to network par77oning not requested • A single machine cannot par77on
• CA cluster
• If a par77on ever happens, all the nodes in the cluster go down (no impact on
the CAP theorem’s defini7on of Availability) • Hard to guarantee
• Operate on a single node
CAP theorem: how to deal with it? (2)
Renounce to consistency or availability
• Solution adopted by NoSQL systems
• They trade-off between consistency and availability
• Not always a Boolean decision, trade a little consistency for a little availability • How much you can trade depends on the specific application domain
• Book a room via hotel booking web site replicated at two locations • C:bothnodesagreeontheserializationofrequests
• NetworkpartitioningwouldcompromiseA
• IncreaseA:master-slaveapproach
• FurtherincreaseA:stillacceptbookinglocally,resolveincaseoflastroom
CAP theorem: how to deal with it? (3)
• NoSQL is a varied world
• In general aggregate-oriented databases support atomic manipula:on of a
single aggregate at a :me
• Atomic manipula:on of mul:ple aggregatesàmanaged in the applica:on
• Considera:ons on where atomicity is wanted is part of the strategy to divide data into aggregates
CAP theorem: conclusions
• Be#er think about the tradeoff consistency and latency
• We can always improve consistency by involving more nodes, but
each node increases the response
Relationships and aggregate-orientation
• Most NoSQL databases store sets of disconnected documents, values, columns
• Difficult to use them for connected data and graphs.
• Possible solu?onàembed an aggregate’s iden?fier inside the field belonging to another aggregate
• Similar to foreign keys
• Joins s?ll at the applica?on level, expensive
• The Riak key-value store allows stored values to be augmented with link metadata, but suitable only for simple graph-structures
Relationships and aggregate-orientation
• Direction of links between aggregates
• Bob’s friends: relatively easy
• Who is friends with Bob? Needs to scan the entire database
• Possible solutionàadding backward links • increased write latency
• increased storage cost
• All these solutions are implementing a graph structure atop a nonnative store
• some of the benefits of partial connectedness, but at substantial cost
Relationships and LPGs
• Base idea behind graph databases is to treat connected data as connected data
• Connec2ons in domains correspond to connec2ons in data
• It is possible to add graphs to increase knowledge
• Example: add
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com