CS代考 Lecture 13-14 – Big Data and CouchDB

Lecture 13-14 – Big Data and CouchDB
Luca Morandini Cloud Architect – Melbourne eResearch Group University of Melbourne

Outline of this Lecture

Copyright By PowCoder代写 加微信 powcoder

Part 1: “Big data” challenges and architectures
● DBMSs for distributed environments
● What distributed DBMSs look like: MongoDB vs CouchDB
● Consistency and availability in distributed environments
● MapReduce algorithms
● Sharding
Part 2: Introduction to CouchDB
● Managing documents
● HTTP API
● Queries (Views, Mango and Full-text Search)
Part 3: Workshop on CouchDB
● Setting up a 3-node cluster with Docker
● Storing and retrieving data using CouchDB

Part 1: “Big data” Challenges and Architectures

“Big data” Is Not Just About “Bigness”
The four “Vs” :
● Volume: yes, volume (Giga, Tera, Peta, …) is a criteria, but
not the only one
● Velocity: the frequency at which new data is being brought into
the system and analytics performed
● Variety: the variability and complexity of data schema. The
more complex the data schema(s) you have, the higher the probability of them changing along the way, adding more complexity.
● Veracity: the level of trust in data accuracy (provenance); the more diverse sources you have, the more unstructured they are, the less veracity you have.

Big Data Calls for Ad-hoc Solutions
● While Relational DBMSs are extremely good at ensuring consistency, they rely on normalized data models that, in a world of big data (think about Veracity and Variety) can no longer be taken for granted.
● Therefore, it makes sense to use DBMSs that are built upon data models that are not relational (relational model: tables, columns and relationships amongst tables -that is, relational algebra).
● While there is nothing preventing SQL to be used in distributed environments, alternative query languages have been used for distributed DBMSs, hence they are sometimes called NoSQL

DBMSs for Distributed Environments
● A key-value store is a DBMS that allows the retrieval of a chunk of data given a key: fast, but crude (e.g. Redis, RocksDB, Berkeley DB)
●A BigTable DBMS stores data in columns grouped into column families, with rows potentially containing different columns of the same family (e.g. Apache Cassandra, Apache Accumulo)
● A Document-oriented DBMS stores data as structured documents, usually expressed as XML or JSON (e.g. Apache CouchDB, MongoDB)

A Tale of Two Clusters
● Distributed databases are run over “clusters”, that is, sets of connected computers
● Clusters are needed to:
○ Distribute the computing load over multiple computers, e.g.
to improve availability
○ Storing multiple copies of data, e.g. to achieve redundancy
● Consider two document-oriented DBMSs (CouchDB and MongoDB) and their typical cluster architectures

CouchDB Cluster Architecture
● All nodes answer requests (read or write) at the same time
● Sharding (splitting of data across nodes) is done on every node
● When a node does not contain a document (say, a document of Shard A is requested to Node 2), the node requests it from another node (say, Node 1) and returns it to the client
● Nodes can be added/removed easily, and their shards are re-balanced automatically upon addition/deletion of nodes
● In this example there are 3 nodes, 4 shards and a replica number of 2

MongoDB Cluster Architecture
● Sharding (splitting of data) is done at the replica set level, hence it involves more than one cluster (a shard is on top of a replica set)
● Only the primary node in a replica set answers write requests, but read requests can – depending on the specifics of the configuration- be answered by every node (including secondary nodes) in the set
● Updates flow only from the primary to the secondary
● If a primary node fails, or discovers it is connected to a minority of nodes, a secondary of the same replica set is elected as the primary
● Arbiters (MongoDB instances without data) can assist in breaking a tie in elections.
● Data are balanced across replica sets
● Since a quorum has to be reached, it is better
to have an odd number of voting members

MongoDB vs CouchDB Clusters
● MongoDB clusters are considerably more complex than CouchDB ones
● MongoDB clusters are less available, as – by default – only primary nodes can
talk to clients for read operations, (and exclusively so for write operations)
● MongoDB software routers (MongoS) must be embedded in application
servers, while any HTTP client can connect to CouchDB
● Losing two nodes out of three in the CouchDB architecture shown, means losing access to between one/quarter and half the data, depending on the
nodes that fail
● Depending on the cluster configuration parameters and the nature (primary or
secondary) of the lost nodes, losing two nodes in the MongoDB example may imply losing write access to half the data (although there are ten nodes in the cluster instead of three), and possibly read access too,
These differences are rooted in different approaches to an unsolvable problem, a problem defined by Brewer’s CAP Theorem

Consistency, Availability, Partition-Tolerance
● Consistency: every client receiving an answer receives the same answer from all nodes in the cluster
● Availability: every client receives an answer from any node in the cluster
● Partition-tolerance: the cluster keeps on operating when one or more nodes cannot communicate with the rest of the cluster

Brewer’s CAP Theorem
Consistency, Availability and Partition-Tolerance: pick any two…
Consistency
Availability
Partition- Tolerance
Note: the intersection of the three sets is empty

Brewer’s CAP Theorem
● While the theorem shows all three qualities are symmetrical, Consistency and Availability are at odds only when a Partition happens
● “Hard” network partitions may be rare, but “soft” ones are not (a slow node may be considered dead even if it is not); ultimately, every partition is detected by a timeout
● Can have consequences that impact the cluster as a whole, e.g. a distributed join is only complete when all sub-queries return
● Traditional DBMS architectures were not concerned with network partitions, since all data were supposed to be in a small, co-located cluster of servers
● The emphasis on numerous commodity servers, can result in an increased
number of hardware failures
● The CAP theorem forces us to consider trade-offs among different options

CAP Theorem and the Classification of Distributed Processing Algorithms
Two-phase commit
Consistency
Availability
Partition- Paxos, Tolerance
Multi-Version Concurrency Control

Consistency and Availability: Two phase commit
This is the usual algorithm used in relational DBMS’s (and MongoDB, to same extent), it enforces consistency by:
● locking data that are within the transaction scope
● performing transactions on write-ahead logs
● completing transactions (commit) only when all nodes in the cluster have
performed the transaction
● aborts transactions (rollback) when a partition is detected
This procedure entails the following:
● reduced availability (data lock, stop in case of partition)
● enforced consistency (every database is in a consistent state, and all are left
in the same state)
Therefore, two-phase commit is a good solution when the cluster is co-located, less so when it is distributed

Consistency and Partition-Tolerance: Paxos
● This family of algorithms is driven by consensus, and is both partition-tolerant and consistent
● In Paxos, every node is either a proposer or an accepter :
○ a proposer proposes a value (with a timestamp)
○ an accepter can accept or refuse it (e.g. if the accepter receives a more
recent value)
● When a proposer has received a sufficient number of acceptances (a quorum
is reached), and a confirmation message is sent to the accepters with the
agreed value
● Paxos clusters can recover from partitions and maintain consistency, but the
smaller part of a partition (the part that is not in the quorum) will not send
responses to clients, hence the availability is reduced
● Raft is a similar, but simpler algorithm solving the same problem

Availability and Partition-tolerance: Multi-Version
Concurrency Control (MVCC)
● MVCC is a method to ensure availability (every node in a cluster always accepts requests), and some sort of recovery from a partition by reconciling the single databases with revisions (data are not replaced, they are just given a new revision number)
● In MVCC, concurrent updates are possible without distributed locks (in optimistic locking only the local copy of the object is locked), since the updates have different revision numbers; the transaction that completes last will get a higher revision number, hence will be considered as the current value.
● In case of cluster partition and concurrent requests with the same revision number going to two partitioned nodes, both are accepted, but once the partition is solved, there would be a conflict… a conflict that would have to be solved somehow (CouchDB returns a list of all current conflicts, which are then left to be solved by the application). Think of it a something similar to a software revision control system such as Git.

Extra: The Peculiar Case of the Blockchain
● Blockchains can be described as distributed, inalterable, verifiable, databases. So, how do they map into this classification? (To fix ideas, let’s focus just on the Bitcoin distributed ledger.)
● Bitcoin works on a cluster of peer-to-peer nodes, each containing a copy of the entire database, operated by different -possibly malicious- actors.
● Since new nodes can enter the system at any time, and every node has the entire database, availability is not an issue even in case of a partition, but consistency cannot be assured, since you cannot trust a single node.
● To achieve consistency, Bitcoin uses a form of MVCC based on proof-of-work (a proxy for the computing power used in a transaction) and on repeated confirmations by a majority of nodes of a history of transactions.
● Bitcoin database security is guaranteed by the impossibility of a single actor having enough computing power to alter the history of transactions (with 6 confirmations, an actor that controls 18% of the computing power has just a 1% probability of compromising a legitimate transaction)

Extra: Why Blockchain does not Help Us
● Blockchains are very inefficient… by design:
○ proof-of-work wastes computing power and it is slow
○ every node contains a copy of the entire database
● Consider the cost of a BitCoin transaction (about 2 USD, down from 59 USD a year ago) and the time it takes (which can take anything between 10 minutes and a couple hours)
● By contrast, consider the cost of a bank transaction in the SEPA system (the European Union interbank payment system): 30 to 50 cents, and it is done seconds
● The difference is that SEPA (and the distributed databases described here) assume that no node can be controlled by a malicious actor bent on altering the database
● In conclusion: BlockChain is solution to a very narrow problem (securing transactions on a public network with potentially malicious actors)… which is great, but it is not our kind of problem

MongoDB vs CouchDB Clusters
● While CouchDB uses MVCC, MongoDB uses a mix of two-phase commit (for replicating data from primary to secondary nodes) and Paxos-like (to elect a primary node in a replica-set)
● From the MongoDB 4.4. Documentation: <>
● The different choices of strategies explains the different cluster architectures of these two DBMSs

Why Document-oriented DBMS for Big data?
While Relational DBMSs are extremely good for ensuring consistency and availability, the normalization that lies at the heart of a relational database model implies fine-grained data, which are less conducive to partition-tolerance than coarse-grained data.
● A typical contact database in a relational data model may
include: a person table, a telephone table, an email table and
an address table, all linked to each other.
● The same database in a document-oriented database would
entail one document type only, with telephones numbers, email addresses, etc., nested as arrays in the same document.

● Sharding is the partitioning of a database “horizontally”, i.e. the database rows (or documents) are partitioned into subsets that are stored on different servers. Every subset of rows is called a shard.
● Usually the number of shards is larger than the number of replicas, and the number of nodes is larger than the number of replicas (usually set to 3)
● The main advantage of a sharded database lies in the improvement of performance through the distribution of computing load across nodes. In addition, it makes it easier to move data files around, e.g. when adding new nodes to the cluster
● The number of shards that split a database dictates the (meaningful) number of nodes: the maximum number of nodes is equal to the number of shards (lest a node contains the same shard file twice)
● There are different sharding strategies, most notably:
○ Hash sharding: to distribute rows evenly across the cluster
○ Range sharding: similar rows (say, tweets coming for the same area)
are stored on the same shard

Replication and Sharding
● Replication is the action of storing the same row (or document) on different nodes to make the database fault-tolerant.
● Sharding is the partition of data into different “buckets”
● Replication and sharding can be combined with the objective of maximizing availability while maintaining a minimum level of data
● A bit of nomenclature (CouchDB-specific, but the concepts can be
generalized to other systems):
○ n is the number of replicas (how many times the same data
item is repeated across the cluster)
○ q is the number of shards (how many files a database is split)
○ n * q is the total number of shard files distributed in the different nodes of the cluster

How Shards Look Like in CouchDB
|– 00000000-1fffffff
| `– test.1520993373.couch |– 20000000-3fffffff
| `– test.1520993373.couch |– 60000000-7fffffff
| `– test.1520993373.couch |– 80000000-9fffffff
| `– test.1520993373.couch |– c0000000-dfffffff
| `– test.1520993373.couch `– e0000000-ffffffff
`– test.1520993373.couch
|– 20000000-3fffffff
| `– test.1520993373.couch |– 40000000-5fffffff
| `– test.1520993373.couch |– 80000000-9fffffff
| `– test.1520993373.couch |– a0000000-bfffffff
| `– test.1520993373.couch `– e0000000-ffffffff
`– test.1520993373.couch
● This is the content of the data/shards directory on a node of a three-node
● The test database has q=8, n=2,
16 shards files
● The *.couch files are the actual files
where data are stored
● The sub-directories are named after the
document _ids ranges
|– 00000000-1fffffff
| `– test.1520993373.couch |– 40000000-5fffffff
| `– test.1520993373.couch |– 60000000-7fffffff
| `– test.1520993373.couch |– a0000000-bfffffff
| `– test.1520993373.couch |– c0000000-dfffffff
`– test.1520993373.couch

Partitions in CouchDB (not network
partitions!)
● A partition is a grouping of logically-related rows in the same shard (for instance, all the tweets of the same user)
● Partitioning improves performance by restricting queries to a narrow set of documents within a single shard
● To be effective, partitions have to be relatively small (certainly smaller than a shard)
● A database has to be declared “partitioned” during its creation
● Partitions are a new feature of CouchDB 3.x

MapReduce Algorithms
● This family of algorithms, pioneered by Google, is particularly suited to parallel computing of the Single-Instruction, Multiple-Data type (see Flynn).
● The first step (Map), distributes data across machines, while the second
(Reduce) hierarchically summarizes them until the result is obtained.
● In between reduce steps, shuffling of data across nodes may happen.
● Apart from parallelism, its advantage lies in moving the process to where data
are, greatly reducing network traffic.
● Example (fiord count):
function map(document):
for each word w in document:
emit (w, 1)
function reduce(word, partialCounts):
for each pc in partialCounts:
emit (word, sum)

Sorting Cards with MapReduce-like Algorithm
Map cards by suite

Part 2: Introduction to CouchDB

Why Using CouchDB in This Course?
●Is open-source, hence you can peruse the source code and see how things work
●It has MapReduce queries, hence you can understand how this programming paradigm works
● It is easy to setup a cluster
●It has sharding, replication, and partitions (not to be
mistaken with network partitions)
● The HTTP API makes it easy to interact with it

CouchDB Main Features
The main features of CouchDB 3.x are:
● Document-oriented DBMS, where documents are expressed in JavaScript
Object Notation (JSON)
● HTTP ReST API (more on ReST in later lectures!)
● Web-based admin interface
● Web-ready: since it talks HTTP and produces JSON (it can also produce
HTML or XML), it can be both the data and logic tier of a three-tier application,
hence avoiding the marshaling and unmarshaling of data objects
● Support for MapReduce algorithms, including aggregation at different levels
● JavaScript as the default data manipulation language
● Full-text search
● Support of MongoDB query language
● Support of replication
● Support of partitions
● Support of sharding
● Support of clusterized databases

Fauxton User Interface
Typing http://:5984/_utils into a browser opens the admin user interface, which lets you do most operations easily, including:
● Create/delete databases
● Edit documents
● Edit design documents
● Run views (MapReduce)
● Run Mango queries
● Run full-text searches
● Modify the configuration
● Manage users
● Set up replications

• A CouchDB instance can have many databases; each database can have its own set of functions (grouped into design documents), and can be split in different shards
• Adding and deleting a database is done through a HTTP call:
curl -X PUT “http://localhost:5984/exampledb“ curl -X DELETE “http://localhost:5984/exampledb“
• Listing all databases of an instance is even simpler
curl -X GET “http://localhost:5984/_all_dbs”
…every response’s body is a JSON object:
[“exampledb”, “twitter”, “instagram”]
• In every CouchDB instance there are system databases. These are prefixed by underscore, such as _users

Insertion and retrieval of documents
• To insert a document:
curl -X POST “http://localhost:5984/exampledb” –header “Content- Type:application/json” –data ‘{“type”: “account”, “holder”: “Alice”, “initialbalance”: 1000}’
Response: 201 (202 if fewer than the prescribed number of write operations were successfully performed)
{“ok”:true,”id”:”c43bcff2cdbb577d8ab2933cdc0011f8″,”rev”:”1- b8a039a8143b474b3601b389081a9eec”}
• To retrieve a document:
curl -X GET “http://localhost:5984/exampledb/c43bcff2cdbb577d8ab2933cdc0011f8”
Response: 200 {“_id”:”c43bcff2cd577d8ab2933cdc0011f8″,”_rev”:”1- b8a039a8143b474b3601b389081a9eec”,”type”:”account”,”holder”:”Alice”,”initial balance”:1000}

System Properties of Documents
• _id: is the ID of a single document which can be set during the document load; by default it is generated by CouchDB and guaranteed to be unique
• _rev: revision number of a document. It is guaranteed to be increasing per- document, i.e. every database instance will pick up the same revision as the current version of the document
• Request to name the ID of a new document (note PUT instead of POST):
curl -X PUT “http://localhost:5984/exampledb/charlie” –header

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com