程序代写代做代考 Java flex go data structure database graph distributed system Haskell javascript 7CCSMBDT – Big Data Technologies Week 5

7CCSMBDT – Big Data Technologies Week 5
Grigorios Loukides, PhD (grigorios.loukides@kcl.ac.uk)
Chapter 5, Erl’s book (Big Data Fundamentals) and Chapter 4, Bagha’s book
Spring 2017/2018
1

NoSQL
What are the characteristics of Not only SQL (NoSQL) databases?
 Non-relational databases
 Highly-scalable and fault-tolerant
 Designed for large, distributed, semi-structured and unstructured data (usually no fixed schema nor joins)
 Mostly for queries, few asynchronous inserts & updates
 Accessible through API-based query interface and data- specific query languages
2

NoSQL
What led to Not only SQL (NoSQL) databases?
 Data got bigger and unstructured
 More users, across the globe, need 24/7 access
 Many apps built quickly
 Interest from companies for Software-as-a-Service (SaaS) and cloud-based solutions
3

Reminder of ACID properties
 Atomicity: all operations succeed or fail (no partial transactions)
 Consistency: all added data needs to conform to constraints
 Isolation: concurrent and sequential execution of transactions yield the same result
 Durability: a committed transaction cannot be rolled back
4

NoSQL
 NoSQL databases relax one or more of the ACID properties  Why?
 Many nodes contain replicas of partitions of the data and the CAP theorem holds
 Consistency:
 After an update is performed, all clients can read it
 Availability
 All clients can read and write, even after some nodes fail
 Partition tolerance
 System remains operational when network is partitioned
(e.g., due to malfunction that prevents nodes from communicating)
5

NoSQL
 NoSQL databases relax one or more of the ACID properties  Why? – CAP theorem
Consistency, Availability, Partition tolerance
• In any distributed system, you have to partition. That leaves either consistency or availability to choose from
• Often, you choose availability over consistency
• Video explaining CAP theorem

6

NoSQL
 NoSQL databases relax one or more of the ACID properties  How? – BASE:
 Basically Available: will always acknowledge a client’s request (answer or success/failure notification)
 Soft state: may be inconsistent when data are read
 Eventually consistent: reads after write may not return consistent
results, but they will once changes propagated to all nodes
Characteristics of NoSQL databases
Availability favored over consistency Approximate answers OK
Simpler and faster
7

NoSQL and 3Vs of Big Data
 Volume
 NoSQL databases allow scaling out (adding more nodes to a
commodity server)
 Velocity
 Fast writes using schema-on-read (data are applied to a schema as
they go out of the database)
 Low write latency (adding nodes decreases latency)
 Variety
 Can store semi-structured and unstructured data, schema is loose or non-existent
8

RDBMS vs. NoSQL
Elastic Scaling
•RDBMS scale up –bigger load , bigger server
•NoSQL scale out –distribute data across multiple hosts seamlessly
Big Data
•Huge increase in data RDBMS: capacity and constraints of data volumes at its limits
•NoSQL designed for big data (previous slide)
DBA Specialists
•RDBMS require highly trained expert to monitor DB
•NoSQL require less management, automatic repair and simpler
data models
9

RDBMS vs. NoSQL
Flexible data models
• RDBMS need careful schema change management • NoSQL databases do not need complicated schema management
Applications written to address an amorphous schema
Economic Cost
•RDBMS rely on expensive proprietary servers to manage data •No SQL: clusters of cheap commodity servers to manage the data and transaction volumes
•Cost per gigabyte or transaction/second for NoSQL can be lower than the cost for a RDBMS
10

RDBMS vs. NoSQL
Support
• RDBMS vendors provide a high level of support to clients
• NoSQL –are open source projects with startups supporting them
Maturity
• RDBMS stable and dependable
• NoSQL are still implementing their basic feature set
11

RDBMS vs. NoSQL
Lack of Expertise
• Whole workforce of trained and seasoned RDMS developers
• Still recruiting developers to the NoSQL camp
Analytics and Business Intelligence
• RDMS designed to address this niche
• NoSQL designed to meet the needs of a Web 2.0
application – not designed for ad hoc data queries
12

Types of NoSQL databases
 Key/value: “Hashtable” of keys
 Document: Stores documents comprised of tagged elements
 Column-family: Each storage block contains data from one column
 Graph: Stores graph-structured data (nodes and edges)
List Of NoSQL databases [currently >225] http://nosql-database.org/ Including many other types of databases
13

Key/value databases
 Store key-value pairs
 Keys are unique
 The values are only retrieved via keys and are opaque to the database
Key
Value
111
John Smth
324
Binary representation of image
567
XML file
 Key-value pairs are organized into collections (a.k.a buckets)
 Data are partitioned across nodes by keys
 The partition for a key is determined by hashing the key
14

Key/value databases
 Pros: very fast
simple model
able to scale horizontally
 Cons:
many data structures (objects) can’t be easily modeled as key value pairs
Provide examples of applications for which a key/value database is appropriate.
15

Key/value databases
 Suitable when:
 Unstructured data
 Fast read/writes
 Key suffices for identifying value
 No dependences among values
 Simple insert/delete/select operations
 Unsuitable when:
 Operations (search, filter, update) on individual
attributes of the value
 Operations on multiple keys in a single transaction
16

Document databases
 Store documents in a semi-structured form
 A document is a nested structure in JSON or XML format
JSON
{
name : “Joe Smith”, title : “Mr”, Address : {
address1 : “Dept. of Informatics”, address2 : “Strand”,
postcode : “WC2 1ER”,
}
expertise: [ “MongoDB”, “Python”, “Javascript” ], employee_number : 320,
location : [ 53.34, -6.26 ]
}
Unordered name-value pairs
Ordered collection of values
17

Document databases
 Store documents in a semi-structured form
 A document is a nested structure in JSON or XML format
XML

Joe Smith Mr

Dept. of Informatics


http://www.w3schools.com/xml/
18

Document vs. key-value databases
1. In document databases, each document has a unique key
2. Document databases provide more support for value operations
 They are aware of values
 A select operation can retrieve fields or parts of a value
 e.g., regular expression on a value
 A subset of values can be updated together
 Indexes are supported
 Each document has a schema that can be inferred from the structure of the value
 e.g., store two employees one of which does not have address
19

Document databases
 Suitable when:
 Semi-structured data with flat or nested schema
 Search on different values of document
 Updates on subsets of values
 CRUD operations (Create, Read, Update, Delete)  Schema changes are likely
 Unsuitable when:
 Binary data
 Updates on multiple documents in a single transaction  Joins between multiple documents
20

Column-family databases
 They store columns. Each column has a name and value.  Column-family databases a.k.a. column-stores
 Group related columns in a row
 A row does not necessarily have fixed schema or number of columns

Column-family databases
ID
Name
Salary
1
Joe D
$24000
2
Peter J
$28000
3
Phil G
$23000
Difference in storage layout
1
Joe D
$24000
2
Peter J
$28000
3
Phil G
$23000
RDBMS
Column Database
1
2
3
Joe D
Peter J
Phil G
$24000
$28000
$23000

Column-family databases
 The storage layout difference matters!
suitable for read-mostly, read-intensive, large data repositories
Slide based on Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
http://www.cs.yale.edu/homes/dna/talks/Column_Store_Tutorial_VLDB09.pdf
Query time in seconds

Column-family databases
Why are column-family databases faster for certain queries?
1. They are efficient for reads when queries need only certain columns
Row store
Column store
2. Columns have similar values
 They compress better to save space
 They can be sorted for efficient range queries through indexing
Slide based on Stavros Harizopoulos, Daniel Abadi, Peter Boncz (2009)
http://www.cs.yale.edu/homes/dna/talks/Column_Store_Tutorial_VLDB09.pdf

Column-family databases
 Suitable when:
 Data has tabular structure with many columns and sparsely
populated rows (i.e., high-dimensional matrix with many 0s)
 Columns are interrelated and accessed together often
 OLAP
 Realtime random read-write is needed
 Insert/select/update/delete operations
 Unsuitable when:  Joins
 ACID support is needed
 Binary data
 SQL-compliant queries
 Frequently changing query patterns that lead to column
restructuring
25

Column-family databases
 Examples of applications:
http://ieeexplore.ieee.org/document/6461866/?reload=true
26

Graph databases
 Store graph-structured data
 Nodes represent the entities and have a set of attributes
 Edges represent the relationships and have a set of attributes
27

Graph databases
 Optimized for representing connections
 Can add/remove edges and their attributes easily
Bob becomes friend of Grace
28

Graph databases
 Designed to allow efficient queries for finding interconnected nodes based on node and/or edge attributes
Find all friends of each node
Scalable w.r.t. nodes because each node knows its neighbors
Find all friends of friends of Jim who are not his colleagues
29

Graph databases
 Their underlying storage can be native graph storage, or a relational database, a key/value database, a document database, etc.
 E.g., Titangraph is a graph database built on Apache Cassandra (modeled as a 2D key-value store*)
* https://en.wikipedia.org/wiki/Wide_column_store
30

Graph databases
 Suitable when:
 Data comprised of interconnected entities  Queries are based on entity relationships
 Finding groups of interconnected entities  Finding distances between entities
 Unsuitable when:  Joins
 ACID support is needed
 Binary data
 SQL-compliant queries
 Frequently changing query patterns that lead to column
restructuring
31

Graph databases
 Applications  Social
Organizations to gain competitive and operational advantage
by leveraging information about the connections between people.
Collaboration, manage information, predict behaviour
 Recommendation
identify resources of interest to a particular individual or group, or individuals and groups likely to have some interest in a particular resource
Recommendation based on what your friends like
 Geo
shortest routes between locations, nearest POIs, intersection between regions
32

MongoDB
Introduction and CRUD operations
Important features
 Replication
 Sharding
 Index support
 Fast in-place updates
 MapReduce functionality
https://docs.mongodb.com/
https://www.mongodb.com/ 33

MongoDB
 A document database
 Hash-based
 Stores hashes (system-assigned _id) with keys and values
for each document
 Dynamic schema
 No Data Definition Language
 Application tracks the schema and mapping
 Uses BSON (Binary JSON) format
 APIs for many languages
 JavaScript, Python, Ruby, Perl, Java,
Scala, C#, C++, Haskell, …
https://www.mongodb.com/ 34

MongoDB: concepts
https://www.mongodb.com/ 35

MongoDB: CRUD
Create
db.products.insert( { _id:10, item: “box”, qty: 20 } ) The document has an _id (system-assigned if unspecified),
a field item with value “box”, and a field qty with value 20.
products
db.products.find( { qty: { $gt: 4 } } ) Documents with qty>4 are returned
(next slide)
Read
Update Delete
db.products.remove( {“item”:”box”} )
Removes all documents with item=box (if multiple)
36

MongoDB: CRUD
books
books after db.books.update
37

Replication
 Multiple replicas (dataset copies) are stored
 Provides scalability (distributed operations), availability (due to
redundancy), and fault tolerance (automatic failover)
 Replica set: group of <=50 mongod instances that contain the same copy of the dataset  Primary instance: receives operation requests  Secondary instances: apply the operations to their data  Automatic failover When a primary does not communicate with its secondaries for >10 seconds, a secondary becomes primary after elections and voting
https://docs.mongodb.com/v3.4/replication/ 38

Sharding: overview
 Sharding is the process of horizontally partitioning the dataset into parts (shards) that are distributed across multiple nodes
 Each node is responsible only for its shard
Shared cluster for running mongodb
Interface between applications and sharded cluster.
Processes all requests
Decides how the query is distributed based on metadata from config server
Stores metadata and configuration settings for the cluster
https://docs.mongodb.com/v3.4/sharding/ 39
To benefit from replication, shards and config server may be implemented as replica sets

Sharding: benefits
 Efficient reads and writes: They are distributed across the shards.
 More shards, higher efficiency
 Storage capacity: Each shard has a part of the dataset
 More shards, increased capacity
 High availability: Partial read / write operations performed if shards are unavailable. Reads or writes directed at the available shards can still succeed.
 More shards, less chance of a shard to be unavailable
https://docs.mongodb.com/v3.4/sharding/
40

Index support
 B+ tree indices, GeoSpatial indices, text indices
 Index created by the system on _id. Can be viewed by
db.system.indexes.find()
 Users can create other indices db.phones.ensureIndex(
{ display : 1 },
{ unique : true, dropDups : true } )
and compound indices
db.products.createIndex( { “item”: 1, “stock”: 1 } ) https://docs.mongodb.com/v3.4/indexes/ 41
Only unique values
in “display” field duplicates are dropped

Index support
 Get indexes on collectionA db.collectionA.getIndexes()
 Drops all indexes
(other than the required one on _id)
db.collectionA.dropIndexes()
 Drops the index with name “catIdx” db.collectionA.dropIndex(“catIdx”)
https://docs.mongodb.com/v3.4/indexes/ 42
[
{
} ]
“v” : 1,
“key” : { “cat” : -1 }, “ns” : “test.pets”, “name” : “catIdx”

Fast in-place updates
 In-place update: Modifies a document without growing its size
Increases stock by 5 and changes item name
{
_id: 1,
item: “book1” stock: 0,
}
db.books.update( { _id: 1 },
{
$inc: { stock: 5 }, $set: {
item: “ABC123“} }
)
 In-place updates are fast because the database does not have to allocate and write a full new copy of the object.
 Multiple updates are written one time (every few seconds). This lazy writing strategy helps efficiency, when updates are frequent (e.g., 1000 increments to stock)
43

MapReduce functionality
“maps” a value with a key and emits the key and value pair
“reduces” to a single object all the values associated with a particular key.
Query to select input documents to the map function. Location of the result (collection or in-line)
https://docs.mongodb.com/manual/reference/method/db.collection.mapReduce/#db.collection.mapReduce
44

MapReduce functionality
var mapF=function(){ emit(this.cust_id, this.amount); }; this refers to the document that
the map-reduce operation is processing
Map applied to docs with status: “A”
https://docs.mongodb.com/manual/reference/method/db.collection.mapReduce/#db.collection.mapReduce
45

MapReduce functionality
var redF=function(key, values){ return Array.sum(values); };
Reduce reads all values with same key
https://docs.mongodb.com/manual/reference/method/db.collection.mapReduce/#db.collection.mapReduce
46

MapReduce functionality
 Executes the MapReduce operation and outputs to the collection “order_totals” db.orders.mapReduce(mapF, redF, {out: “order_totals”})
https://docs.mongodb.com/manual/reference/method/db.collection.mapReduce/#db.collection.mapReduce
47

EXTRA SLIDES (for reference)
48

Question
 An application requires removing middle name, for each value that corresponds to a person’s name.
Key
Value
111
John K. Smith
324
Sarah R. Lee
953
….


 Is a key/value NoSQL database suitable for supporting the application or not? Justify your answer
49

Example of geospatial index
 Create a 2dsphere Index on “loc” field db.collectionA.createIndex( { loc: “2dsphere” } )
Insert some data
Places within [1000m,5000m] from Central Park
db.places.insert( {
loc : { type: “Point”,
coordinates: [ -73.97, 40.77 ] },
name: “Central Park”,
category : “Parks” })
https://docs.mongodb.com/v3.4/indexes/
50
db.places.find( {
location: { $near :
{
$geometry: { type: “Point”,
coordinates: [ -73.9667, 40.78 ] }, $minDistance: 1000, $maxDistance: 5000
} }
} )