INFO20003: Database Systems
Dr Renata Borovica-Gajic
Lecture 21 NoSQL Databases
Week 11
INFO20003 Database Systems © University of Melbourne 1
Learning Objectives
• Bytheendofthissession,youshouldbeableto:
– Define what Big Data is
– Describe why databases go beyond relational DBs – Understand why we need NoSQL
– Types of NoSQL
– CAP theorem
* material in this lecture is drawn from http://martinfowler.com/books/nosql.html, including talk at GOTO conference 2012
and Thoughtworks article at https://www.thoughtworks.com/insights/blog/nosql- databases-overview
INFO20003 Database Systems © University of Melbourne 2
Much of business data is tabular
INFO20003 Database Systems © University of Melbourne 3
The dominance of the relational model
• Prosofrelationaldatabases
– simple, can capture (nearly) any business use case
– can integrate multiple applications via shared data store – standard interface language SQL
– ad-hoc queries, across and within “data aggregates”
– fast, reliable, concurrent, consistent
• Consofrelationaldatabases
– Object Relational (OR) impedance mismatch – not good with big data
– not good with clustered/replicated servers
• AdoptionofNoSQLdrivenby“cons”ofRelational
• but ‘polyglot persistence‘ = Relational will not go away
INFO20003 Database Systems © University of Melbourne 4
But some data is not inherently tabular
One business object (in aggregate form) is stored across many relational tables.
This enables analytical queries like:
select productno, sum(qty) from InvoiceLineItem
group by productno;
But there is a lot of work to dissemble and reassemble the aggregate.
INFO20003 Database Systems © University of Melbourne 5
Data in Aggregate form: Examples of JSON and XML
JavaScript Object Notation
eXtensible Markup Language
INFO20003 Database Systems © University of Melbourne 6
Big Data and its 3Vs
• Datathatexistinverylargevolumesandmany different varieties (data types) and that need to be processed at a very high velocity (speed).
– Volume – much larger quantity of data than typical for relational databases
– Variety – lots of different data types and formats
– Velocity – data comes at very fast rate (e.g. mobile sensors, web click stream)
INFO20003 Database Systems © University of Melbourne 7
Big Data Characteristics
• Schema on Read, rather than Schema on Write
• Schema on Write– preexisting data model, how traditional databases are designed (relational databases)
• Schema on Read – data model determined later, depends on how you want to use it (XML, JSON)
• Capture and store the data, and worry about how you want to use it later
• Data Lake
– A large integrated repository for internal and external data that does not follow a predefined schema
– Capture everything, dive in anywhere, flexible access
Jeff Hoffer, Ramesh Venkataraman and Heikki Topi , Modern Database Management: Global Edition
INFO20003 Database Systems © University of Melbourne 8
Schema on write vs. schema on read
Traditional database design
The big data approach
© University of Melbourne
NoSQL database properties
• Features
– Doesn’t use relational model or SQL language – Runs well on distributed servers
– Most are open-source
– Built for the modern web
– Schema-less (though there may be an “implicit schema”) – Supports schema on read
– Not ACID compliant
– ‘Eventually consistent’
• Goals
– to improve programmer productivity (OR mismatch)
– to handle larger data volumes and throughput (big data)
from NoSQL Databases: An Overview by Pramod Sadalage, Thoughtworks(2014)
INFO20003 Database Systems © University of Melbourne 10
Types of NoSQL databases
INFO20003 Database Systems © University of Melbourne 11
Types of NoSQL: key-value stores
• Key = primary key
• Value = anything (number, array, image, JSON) –the application is in charge of interpreting what it means
• Operations: Put (for storing), Get and Update
• Examples: Riak, Redis, Memcached, BerkeleyDB, HamsterDB, Amazon DynamoDB, Project Voldemort, Couchbase
INFO20003 Database Systems © University of Melbourne 12
Types of NoSQL: document databases
• Similar to a key-value store except that the document is “examinable” by the databases, so its content can be queried, and parts of it updated
• Document = JSON file
• Examples: MongoDB, CouchDB, Terrastore, OrientDB, RavenDB
INFO20003 Database Systems © University of Melbourne 13
MongoDB Document Structure
• MongoDB documents are composed of field-and- value pairs
INFO20003 Database Systems © University of Melbourne 14
MongoDB document store
start the mongodb server, then start the mongo shell with “mongo” show dbs// show a list of all databases
use test// use the database called ‘test’
show collections// show all collections in the database ‘test’ db.students.insert( {name: “Jack”, born: 1992} )// add a doc to collection db.students.insert( {name: “Jill”, born: 1990} )// add a doc to collection db.students.find()// list all docs in students
db.students.find( {name: “Jill”} )// list all docs where name field = ‘Jill’ db.students.update( {name: “Jack”}, {$set: {born: 1990}} ) // change Jack’s year db.students.remove( {born: 1990} ) // delete docs where year = 1990
// now insert complex documents from file –note repeating group, no schema db.students.find().forEach(printjson)// print all docs in neat JSON format db.students.find( {born:1990}, {name: true} )// print names for all born in 1990 db.students.update( {id:222222}, {$addToSet:
{subjects: {subject: “English”, result: “H1″}}} ) Update data deep in hierarchy db.students.find( {id:222222}, {_id:false, subjects:true} ).forEach(printjson)
db.students.insert( {name: “John”, color: “blue”} )
// add a new student – different schema but still works
INFO20003 Database Systems © University of Melbourne 15
Types of NoSQL: column families
• Columnsratherthanrowsarestoredtogetherondisk.
• Makesanalysisfaster,aslessdataisfetched.
• This is like automatic vertical partitioning.
• Relatedcolumnsgroupedtogetherinto‘families’.
• Examples:Cassandra,BigTable,HBase
https://www.youtube.com/watch? v=8KGVFB3kVHQ
INFO20003 Database Systems © University of Melbourne 16
Aggregate-oriented databases
• Key-value,documentstoreandcolumn-familyare “aggregate-oriented- store business object in its entirety” databases (in Fowler’s terminology)
• Pros:
– entire aggregate of data is stored together (no need for transactions)
– efficient storage on clusters / distributed databases
• Cons:
– hard to analyse across subfields of aggregates – e.g. sum over products instead of orders
INFO20003 Database Systems © University of Melbourne 17
Types of NoSQL: graph databases
• A ‘graph’ is a node-and-arc network
• Social graphs (e.g. friendship graphs) are common examples
• Graphs are difficult to program in relational DB
• A graph DB stores entities and their relationships • Graph queries deduce knowledge from the graph
• Examples: Neo4J
Infinite Graph OrientDBv FlockDB TAO
INFO20003 Database Systems © University of Melbourne 18
Summary: NoSQL Classifications
• Key-valuestores
– A simple pair of a key and an associated collection of values. Key is usually a string. The database has no knowledge of the structure or meaning of the values.
• Documentstores
– Like a key-value store, but “document” goes further than “value”. The document is structured, so specific elements can be manipulated separately.
• Column-familystores
– Data is grouped in “column groups/families” for efficiency reasons.
• Graph-orienteddatabases
– Maintain information regarding the relationships between data items. Nodes with properties.
INFO20003 Database Systems © University of Melbourne 21
Distributed data: the CAP theorem
INFO20003 Database Systems © University of Melbourne 22
CAP theorem: alternative presentation
• Fowler’s version of CAP theorem: If you have a distributed database, when a partition occurs, you must then choose consistency OR availability.
INFO20003 Database Systems © University of Melbourne 23
ACID vs BASE
ACID (Atomic, Consistent, Isolated, Durable)
vs
Base (Basically Available, Soft State, Eventual Consistency)
• Basically Available: This constraint states that the system does guarantee the availability of the data; there will be a response to any request. But data may be in an inconsistent or changing state.
• Soft state: The state of the system could change over time -even during times without input there may be changes going on due to ‘eventual consistency’.
• Eventual consistency: The system will eventually become consistent once it stops receiving input. The data will propagate to everywhere it needs to, sooner or later, but the system will continue to receive input and is not checking the consistency of every transaction before it moves onto the next one.
INFO20003 Database Systems © University of Melbourne 24
If you want to know more
More technical details (I won’t ask you these things):
https://www.youtube.com/watch?v=YUWUH_7aWHs&index=11&list=PLd QddgMBv5zHcEN9RrhADq3CBColhY2hl
INFO20003 Database Systems © University of Melbourne 25
What is examinable?
• Whatisbigdata/NoSQL?
• WhatarethecharacteristicsofNoSQLdatabases • TypesofNoSQLdatabases
• CAPtheorem/BASE
INFO20003 Database Systems © University of Melbourne 26
Next lecture
• Databases of the future (non-examinable research avenues)
INFO20003 Database Systems © University of Melbourne 27