7CCSMBDT – Big Data Technologies Week 5
Grigorios Loukides, PhD (grigorios.loukides@kcl.ac.uk)
Chapter 5, Erl’s book (Big Data Fundamentals) and Chapter 4, Bagha’s book
Spring 2017/2018
1
NoSQL
What are the characteristics of Not only SQL (NoSQL) databases?
Non-relational databases
Highly-scalable and fault-tolerant
Designed for large, distributed, semi-structured and unstructured data (usually no fixed schema nor joins)
Mostly for queries, few asynchronous inserts & updates
Accessible through API-based query interface and data- specific query languages
2
NoSQL
What led to Not only SQL (NoSQL) databases?
Data got bigger and unstructured
More users, across the globe, need 24/7 access
Many apps built quickly
Interest from companies for Software-as-a-Service (SaaS) and cloud-based solutions
3
Reminder of ACID properties
Atomicity: all operations succeed or fail (no partial transactions)
Consistency: all added data needs to conform to constraints
Isolation: concurrent and sequential execution of transactions yield the same result
Durability: a committed transaction cannot be rolled back
4
NoSQL
NoSQL databases relax one or more of the ACID properties Why?
Many nodes contain replicas of partitions of the data and the CAP theorem holds
Consistency:
After an update is performed, all clients can read it
Availability
All clients can read and write, even after some nodes fail
Partition tolerance
System remains operational when network is partitioned
(e.g., due to malfunction that prevents nodes from communicating)
5
NoSQL
NoSQL databases relax one or more of the ACID properties Why? – CAP theorem
Consistency, Availability, Partition tolerance
• In any distributed system, you have to partition. That leaves either consistency or availability to choose from
• Often, you choose availability over consistency
• Video explaining CAP theorem
6
NoSQL
NoSQL databases relax one or more of the ACID properties How? – BASE:
Basically Available: will always acknowledge a client’s request (answer or success/failure notification)
Soft state: may be inconsistent when data are read
Eventually consistent: reads after write may not return consistent
results, but they will once changes propagated to all nodes
Characteristics of NoSQL databases
Availability favored over consistency Approximate answers OK
Simpler and faster
7
NoSQL and 3Vs of Big Data
Volume
NoSQL databases allow scaling out (adding more nodes to a
commodity server)
Velocity
Fast writes using schema-on-read (data are applied to a schema as
they go out of the database)
Low write latency (adding nodes decreases latency)
Variety
Can store semi-structured and unstructured data, schema is loose or non-existent
8
RDBMS vs. NoSQL
Elastic Scaling
•RDBMS scale up –bigger load , bigger server
•NoSQL scale out –distribute data across multiple hosts seamlessly
Big Data
•Huge increase in data RDBMS: capacity and constraints of data volumes at its limits
•NoSQL designed for big data (previous slide)
DBA Specialists
•RDBMS require highly trained expert to monitor DB
•NoSQL require less management, automatic repair and simpler
data models
9
RDBMS vs. NoSQL
Flexible data models
• RDBMS need careful schema change management • NoSQL databases do not need complicated schema management
Applications written to address an amorphous schema
Economic Cost
•RDBMS rely on expensive proprietary servers to manage data •No SQL: clusters of cheap commodity servers to manage the data and transaction volumes
•Cost per gigabyte or transaction/second for NoSQL can be lower than the cost for a RDBMS
10
RDBMS vs. NoSQL
Support
• RDBMS vendors provide a high level of support to clients
• NoSQL –are open source projects with startups supporting them
Maturity
• RDBMS stable and dependable
• NoSQL are still implementing their basic feature set
11
RDBMS vs. NoSQL
Lack of Expertise
• Whole workforce of trained and seasoned RDMS developers
• Still recruiting developers to the NoSQL camp
Analytics and Business Intelligence
• RDMS designed to address this niche
• NoSQL designed to meet the needs of a Web 2.0
application – not designed for ad hoc data queries
12
Types of NoSQL databases
Key/value: “Hashtable” of keys
Document: Stores documents comprised of tagged elements
Column-family: Each storage block contains data from one column
Graph: Stores graph-structured data (nodes and edges)
List Of NoSQL databases [currently >225] http://nosql-database.org/ Including many other types of databases
13
Key/value databases
Store key-value pairs
Keys are unique
The values are only retrieved via keys and are opaque to the database
Key
Value
111
John Smth
324
Binary representation of image
567
XML file
Key-value pairs are organized into collections (a.k.a buckets)
Data are partitioned across nodes by keys
The partition for a key is determined by hashing the key
14
Key/value databases
Pros: very fast
simple model
able to scale horizontally
Cons:
many data structures (objects) can’t be easily modeled as key value pairs
Provide examples of applications for which a key/value database is appropriate.
15
Key/value databases
Suitable when:
Unstructured data
Fast read/writes
Key suffices for identifying value
No dependences among values
Simple insert/delete/select operations
Unsuitable when:
Operations (search, filter, update) on individual
attributes of the value
Operations on multiple keys in a single transaction
16
Document databases
Store documents in a semi-structured form
A document is a nested structure in JSON or XML format
JSON
{
name : “Joe Smith”, title : “Mr”, Address : {
address1 : “Dept. of Informatics”, address2 : “Strand”,
postcode : “WC2 1ER”,
}
expertise: [ “MongoDB”, “Python”, “Javascript” ], employee_number : 320,
location : [ 53.34, -6.26 ]
}
Unordered name-value pairs
Ordered collection of values
17
Document databases
Store documents in a semi-structured form
A document is a nested structure in JSON or XML format
XML
…
http://www.w3schools.com/xml/
18
Document vs. key-value databases
1. In document databases, each document has a unique key
2. Document databases provide more support for value operations
They are aware of values
A select operation can retrieve fields or parts of a value
e.g., regular expression on a value
A subset of values can be updated together
Indexes are supported
Each document has a schema that can be inferred from the structure of the value
e.g., store two employees one of which does not have address
19
Document databases
Suitable when:
Semi-structured data with flat or nested schema
Search on different values of document
Updates on subsets of values
CRUD operations (Create, Read, Update, Delete) Schema changes are likely
Unsuitable when:
Binary data
Updates on multiple documents in a single transaction Joins between multiple documents
20
Column-family databases
They store columns. Each column has a name and value. Column-family databases a.k.a. column-stores
Group related columns in a row
A row does not necessarily have fixed schema or number of columns
Column-family databases
ID
Name
Salary
1
Joe D
$24000
2
Peter J
$28000
3
Phil G
$23000
Difference in storage layout
1
Joe D
$24000
2
Peter J
$28000
3
Phil G
$23000
RDBMS
Column Database
1
2
3
Joe D
Peter J
Phil G
$24000
$28000
$23000
Column-family databases
The storage layout difference matters!
suitable for read-mostly, read-intensive, large data repositories
Slide based on Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
http://www.cs.yale.edu/homes/dna/talks/Column_Store_Tutorial_VLDB09.pdf
Query time in seconds
Column-family databases
Why are column-family databases faster for certain queries?
1. They are efficient for reads when queries need only certain columns
Row store
Column store
2. Columns have similar values
They compress better to save space
They can be sorted for efficient range queries through indexing
Slide based on Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
http://www.cs.yale.edu/homes/dna/talks/Column_Store_Tutorial_VLDB09.pdf
Column-family databases
Suitable when:
Data has tabular structure with many columns and sparsely
populated rows (i.e., high-dimensional matrix with many 0s)
Columns are interrelated and accessed together often
OLAP
Realtime random read-write is needed
Insert/select/update/delete operations
Unsuitable when: Joins
ACID support is needed
Binary data
SQL-compliant queries
Frequently changing query patterns that lead to column
restructuring
25
Column-family databases
Examples of applications:
http://ieeexplore.ieee.org/document/6461866/?reload=true
26
Graph databases
Store graph-structured data
Nodes represent the entities and have a set of attributes
Edges represent the relationships and have a set of attributes
27
Graph databases
Optimized for representing connections
Can add/remove edges and their attributes easily
Bob becomes friend of Grace
28
Graph databases
Designed to allow efficient queries for finding interconnected nodes based on node and/or edge attributes
Find all friends of each node
Scalable w.r.t. nodes because each node knows its neighbors
Find all friends of friends of Jim who are not his colleagues
29
Graph databases
Their underlying storage can be native graph storage, or a relational database, a key/value database, a document database, etc.
E.g., Titangraph is a graph database built on Apache Cassandra (modeled as a 2D key-value store*)
* https://en.wikipedia.org/wiki/Wide_column_store
30
Graph databases
Suitable when:
Data comprised of interconnected entities Queries are based on entity relationships
Finding groups of interconnected entities Finding distances between entities
Unsuitable when: Joins
ACID support is needed
Binary data
SQL-compliant queries
Frequently changing query patterns that lead to column
restructuring
31
Graph databases
Applications Social
Organizations to gain competitive and operational advantage
by leveraging information about the connections between people.
Collaboration, manage information, predict behaviour
Recommendation
identify resources of interest to a particular individual or group, or individuals and groups likely to have some interest in a particular resource
Recommendation based on what your friends like
Geo
shortest routes between locations, nearest POIs, intersection between regions
32
MongoDB
Introduction and CRUD operations
Important features
Replication
Sharding
Index support
Fast in-place updates
MapReduce functionality
https://docs.mongodb.com/
https://www.mongodb.com/ 33
MongoDB
A document database
Hash-based
Stores hashes (system-assigned _id) with keys and values
for each document
Dynamic schema
No Data Definition Language
Application tracks the schema and mapping
Uses BSON (Binary JSON) format
APIs for many languages
JavaScript, Python, Ruby, Perl, Java,
Scala, C#, C++, Haskell, …
https://www.mongodb.com/ 34
MongoDB: concepts
https://www.mongodb.com/ 35
MongoDB: CRUD
Create
db.products.insert( { _id:10, item: “box”, qty: 20 } ) The document has an _id (system-assigned if unspecified),
a field item with value “box”, and a field qty with value 20.
products
db.products.find( { qty: { $gt: 4 } } ) Documents with qty>4 are returned
(next slide)
Read
Update Delete
db.products.remove( {“item”:”box”} )
Removes all documents with item=box (if multiple)
36
MongoDB: CRUD
books
books after db.books.update
37
Replication
Multiple replicas (dataset copies) are stored
Provides scalability (distributed operations), availability (due to
redundancy), and fault tolerance (automatic failover)
Replica set: group of <=50 mongod instances that contain the same copy of the dataset
Primary instance: receives operation requests
Secondary instances: apply the operations to
their data
Automatic failover
When a primary does not communicate with its secondaries for >10 seconds, a secondary becomes primary after elections and voting
https://docs.mongodb.com/v3.4/replication/ 38
Sharding: overview
Sharding is the process of horizontally partitioning the dataset into parts (shards) that are distributed across multiple nodes
Each node is responsible only for its shard
Shared cluster for running mongodb
Interface between applications and sharded cluster.
Processes all requests
Decides how the query is distributed based on metadata from config server
Stores metadata and configuration settings for the cluster
https://docs.mongodb.com/v3.4/sharding/ 39
To benefit from replication, shards and config server may be implemented as replica sets
Sharding: benefits
Efficient reads and writes: They are distributed across the shards.
More shards, higher efficiency
Storage capacity: Each shard has a part of the dataset
More shards, increased capacity
High availability: Partial read / write operations performed if shards are unavailable. Reads or writes directed at the available shards can still succeed.
More shards, less chance of a shard to be unavailable
https://docs.mongodb.com/v3.4/sharding/
40
Index support
B+ tree indices, GeoSpatial indices, text indices
Index created by the system on _id. Can be viewed by
db.system.indexes.find()
Users can create other indices db.phones.ensureIndex(
{ display : 1 },
{ unique : true, dropDups : true } )
and compound indices
db.products.createIndex( { “item”: 1, “stock”: 1 } ) https://docs.mongodb.com/v3.4/indexes/ 41
Only unique values
in “display” field duplicates are dropped
Index support
Get indexes on collectionA db.collectionA.getIndexes()
Drops all indexes
(other than the required one on _id)
db.collectionA.dropIndexes()
Drops the index with name “catIdx” db.collectionA.dropIndex(“catIdx”)
https://docs.mongodb.com/v3.4/indexes/ 42
[
{
} ]
“v” : 1,
“key” : { “cat” : -1 }, “ns” : “test.pets”, “name” : “catIdx”
Fast in-place updates
In-place update: Modifies a document without growing its size
Increases stock by 5 and changes item name
{
_id: 1,
item: “book1” stock: 0,
}
db.books.update( { _id: 1 },
{
$inc: { stock: 5 }, $set: {
item: “ABC123“} }
)
In-place updates are fast because the database does not have to allocate and write a full new copy of the object.
Multiple updates are written one time (every few seconds). This lazy writing strategy helps efficiency, when updates are frequent (e.g., 1000 increments to stock)
43
MapReduce functionality
“maps” a value with a key and emits the key and value pair
“reduces” to a single object all the values associated with a particular key.
Query to select input documents to the map function. Location of the result (collection or in-line)
https://docs.mongodb.com/manual/reference/method/db.collection.mapReduce/#db.collection.mapReduce
44
MapReduce functionality
var mapF=function(){ emit(this.cust_id, this.amount); }; this refers to the document that
the map-reduce operation is processing
Map applied to docs with status: “A”
https://docs.mongodb.com/manual/reference/method/db.collection.mapReduce/#db.collection.mapReduce
45
MapReduce functionality
var redF=function(key, values){ return Array.sum(values); };
Reduce reads all values with same key
https://docs.mongodb.com/manual/reference/method/db.collection.mapReduce/#db.collection.mapReduce
46
MapReduce functionality
Executes the MapReduce operation and outputs to the collection “order_totals” db.orders.mapReduce(mapF, redF, {out: “order_totals”})
https://docs.mongodb.com/manual/reference/method/db.collection.mapReduce/#db.collection.mapReduce
47
EXTRA SLIDES (for reference)
48
Question
An application requires removing middle name, for each value that corresponds to a person’s name.
Key
Value
111
John K. Smith
324
Sarah R. Lee
953
…
…
Is a key/value NoSQL database suitable for supporting the application or not? Justify your answer
49
Example of geospatial index
Create a 2dsphere Index on “loc” field db.collectionA.createIndex( { loc: “2dsphere” } )
Insert some data
Places within [1000m,5000m] from Central Park
db.places.insert( {
loc : { type: “Point”,
coordinates: [ -73.97, 40.77 ] },
name: “Central Park”,
category : “Parks” })
https://docs.mongodb.com/v3.4/indexes/
50
db.places.find( {
location: { $near :
{
$geometry: { type: “Point”,
coordinates: [ -73.9667, 40.78 ] }, $minDistance: 1000, $maxDistance: 5000
} }
} )