程序代写代做代考 distributed system database algorithm hbase data structure cache COMP5338 – Advanced Data Models

COMP5338 – Advanced Data Models

Dr. Ying Zhou
School of Information Technologies

COMP5338 – Advanced Data Models
Week 9: Key Value Storage Systems

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou)

Outline
 Overview

 K-V store
Memcached brief intro

 Dynamo
 Overveiw
 Partitioning Algorithm
 Replication and Consistency

09-2

Key-Value Data Model
 Simplest form of data model

 Each data “record” contains a key and a value
 All queries are key based

 Similar concept in programming language/data structure
 Associative array, hash map, hashtable, dictionary

 Basic API similar to those in hashtable
 put key value
 value = get key

 There are many such systems
 Some just provides pure K-V form

 The system treats value as uninterpretable string/byte array
 Others may add further dimensions in the value part

 The value can be a json document, e.g HyperDex
 The value can contain columns and corresponding values, e.g.

Bigtable/HBase

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 09-3

From Keys to Hashes
 In a key-value store, or key-value data structure, the key can be of any

type
 String, Number, Date and Time, Complex object,

 They are usually hashed to get a value of fixed size
 Hash function is any function that can be used to map data of arbitrary size

to data of a fixed size
 E.g. any Wikipedia page’s content can be hashed by SHA-1 algorithm to produce

a message-digest of 160-bit value
 The result is called a hash value, hash code, hash sum or hash

 Hash allows for efficient storage and lookup

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 09-4

Hash Entry Data
1 (Alice, 90)

3 (Tom, 85)
4 (Bob, 87)

key Hash
Alice 1
Tom 3
Bob 4

Memcached: motivation
 Memcached is a very early in-memory key-value store
 The main use case is to provide caching for web application, in

particular, caching between the database and application layer
 Motivations:

 Database queries are expensive, cache is necessary
 Cache on DB nodes would make it easy to share query results among

various requests
 But the raw memory size is limited

 Caching more on Web nodes would be a natural extension but requests by
default have their own memory spaces and do not share with each other

 Simple solution
 “Have a global hash table that all Web processes on all machines could

access simultaneously, instantly seeing one another’s changes”

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 09-5

https://www.linuxjournal.com/article/7451

Memcached: features
 As a cache system for query results

 Durability is not part of the consideration
 Data is persisted in the database
 No replication is implemented

Mechanisms for guarantee freshness should be included
 Expiration mechanism

 Cache miss is a norm
 But want to minimize that

 Distributed cache store
 Utilizing free memories on all machines
 Each machine may run one or many Memcached server instances
 It is beneficial to run multiple Memcached instances on single

machine if the physical memory is larger than the address space
supported by the OS

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 09-6

Memcached: the distributed caching
 The keys of the global hash table are distributed among a

number of Memcached server instances
 How do we know the location of a particular key

 Simple solution
 Each memcached instance is totally independent
 A given key is kept in the same server
 Client library knows the list of servers and can use a predetermined

partition algorithm to work out the designated server of a key
 Client library runs on web node
 There are many implementations
 May have different ways to partition keys

 Simple hash partition or consistent hashing

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 09-7

Memcached: basic hash partition
 Treat Memcached instanced as buckets

 Instance with larger memory may represent more buckets
 Use modulo function to determine the location of each key

 Hash(key) mod #buckets

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 09-8

Client library runs on web node

Two memcached servers

One server has
512M memory
for Memcached,
and is
considered as
one bucket with
id 0

Another server has
1G memory for
Memcached, and is
considered as two
bucket with id 1 and
2

Client library knows there are 3
buckets, so would determine the
location of a key by mod 3 of any
key’s hash value

Key ‘foo’ has a hash value of 17, mod 3 is 2

This key should be stored on the server managed bucket id 2

Outline
 Overview

 K-V store
Memcached brief intro

 Dynamo
 Overview
 Partitioning Algorithm
 Replication and Consistency

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 09-9

Dynamo
 Motivation

Many services in Amazon only need primary key access to data
store
 E.g. shopping cart

 Both scalability and availability are essential terms in the service
level agreement
 Always writable (write never fails)
 Guaranteed latency
 Highly available

 Design consideration
 Simple key value model
 Sacrifice strong consistency for availability
 Conflict resolution is executed during read instead of write
 Incremental scalability

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 09-10

Service Level Agreements (SLA)
 Application can deliver

its functionality in a
bounded time: Every
dependency in the platform
needs to deliver its functionality
with even tighter bounds.

 Example: service
guaranteeing that it will provide a
response within 300ms for
99.9% of its requests for a peak
client load of 500 requests per
second.

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 09-11

Dynamo Techniques Summary
 Dynamo is a decentralized peer-to-peer system

 All nodes taking the same role
 There is no master node

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou)

Problem Technique Advantage

Partitioning Consistent Hashing Incremental Scalability

High Availability for
writes

Vector clocks with
reconciliation during reads

Version size is decoupled from
update rates.

Handling temporary
failures

Sloppy Quorum and hinted
handoff

Provides high availability and
durability guarantee when some of

the replicas are not available.

Recovering from
permanent failures

Anti-entropy using Merkle
trees

Synchronizes divergent replicas in
the background.

Membership and failure
detection

Gossip-based membership
protocol and failure detection.

Preserves symmetry and avoids
having a centralized registry for
storing membership and node

liveness information.

09-12

Outline
 Overview

 K-V store
Memcached brief intro

 Dynamo
 Overview
 Partitioning Algorithm
 Replication and Consistency

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 09-13

Partitioning Algorithm
 Partition or shard a data set

 There is a partition or shard key
 System default vs. user specified key

 There is an algorithm to decide which data goes to which partition
 Range partition vs. Random(Hash) partition

 What happens when data grows
 Bigtable/HBase way: split a partition that grows beyond threshold

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou)

0
1
2
3

0
1

2
3

2
3
4
5

2
3

4
5

• Split happens locally, does not
have global impact

• Data may not be evenly
distributed in each partition

09-14

Hash Partition- Repartition
 Repartition may involves a lot of data movement

 The modulo function of key’s hash value will redistribute large
number of keys to different partitions

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou)

0, 1, 2, 3

Partition id:0

Partition id:0

0,2 1,3

0, 1, 2, 3, 4,5

Partition id:1

Partition id:1 Partition id:2

0,3 2,4 1,5

09-15

hv(key) mod 2

hv(key) mod 3

Consistent Hashing
 Consistent hashing

 “ a special kind of hashing such that when a hash table is resized,
only K/n keys need to be remapped on average, where K is the
number of keys, and n is the number of slots.”

[Wikipedia: Consistent Hashing]

 It does not identify each partition as a number in [0,n-1]
 The output range of a hash function is treated as a fixed circular

space or “ring” (i.e. the largest hash value wraps around to the
smallest hash value).

 Each partition represents a range in the ring space, identified by its
position value (token)

 The hash of a data record’s key will uniquely locate in a range
 In a distributed system, each node represents one partition or a

number of partitions if “virtual node” is used.

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 09-16

Consistent Hashing
 Each node in the Dynamo cluster is assigned a “token” representing its

position in the “ring”
 Each node is responsible for the region in the ring between it and its

predecessor node
 The ring space is the MD5 Hash value space (128 bit)

 0 to 2127 -1

 The MD5 Hash of the key of any data record is used to determine which
node is the coordinator of this key. The coordinator is responsible for
 Storing the row data locally
 Replicating the row data in N-1 other nodes, where N is the replication factor

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 09-17

Consistent hashing example

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou)

The node with token 50 is
responsible for this key
This is called the
coordinator of this key

31

row key: “bsanderson”
is hashed to a number 31

Ring space: 0~99
76,77,…99,0

09-18

Virtual Nodes
 Possible issue of the basic consistent hashing algorithm

 Position is randomly assigned, cannot guarantee balanced distribution of data on node
 Assume node have similar capacity, each is handling one partition

 It does not guarantee that similar row keys will be stored/managed by the same
node

 Virtual nodes and multiple partitions per node

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou)

(0, 40)

(20)

(30,50,80)

(10, 70)

0,10,20,30,40,50,70,80(80,0]
(30,40]

(10,20]

(20,30]
(40,50]
(70,80]

(0,10]
(50,70]

31

09-19

Memcached has similar
design consideration: an
instance with a lot of
available memories can
be allocated more than
one buckets

Revisit: Memcached distributed hashing

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 09-20

The server with 1G
memory is allocated
two buckets

Replication

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou)

 Replication is essential for high
availability and durability
 Replication factor (N)
 Coordinator
 Preference list

 Each key (and its data) is stored in the
coordinator node as well as N-1
clockwise successor nodes in the ring

 The list of nodes that is responsible for
storing a particular key is called the
preference list

 Preference list contains more than N
nodes to allow for node failures
 Some node are used as temporary

storage.
 Can be computed on the fly by any

node in the system

31

31

31

coordinator

a

d
b

c

Preference list for this key: {c,d,a,b}

09-21

Membership and Failure Detection
 Each node in the cluster is aware of the token range

handled by its peers
 This is done using a gossip protocol
 New node joining the cluster will randomly pick a token
 The information is gossiped around the cluster

 Failure detection is also achieved through gossip protocol
 Local knowledge
 Nodes do not have to agree on whether or not a node is “really

dead”.
 Used to handle temporary node failure to avoid communication cost

during read/write
 Permanent node departure is handled externally

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 09-22

Adding Storage Node

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou)

10

range: (0-10]

range: (10-25]

Receive data in range (0-10] and serve as coordinator

Receive data in range (75,0] and serve as replica

Receive data in range (50,75] and serve as replica

Only a small amount of data change their
coordinator node

http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2

No longer the coordinator node for keys in (1-10]

No longer a replica for data in range (50,75]

No longer a replica for data in range (0,10]

09-23

Outline
 Overview

 Partitioning algorithm

 Replication and Consistency

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 09-24

Read and Write with Replication
 When there are replicas, there are many options for

read/write
 In a typical Master/Slave replication environment

Write happens on the master and may propagate to the replica
immediately and wait for all to ack before declaring success, or lazily
and declare success after the master finishes write

 Read may happen on the master (strong consistency) or at one
replica (may get stale data)

 In an environment where there is no designated
master/coordinator, other mechanisms need to be used to
ensure certain level of consistency
 Order of concurrent writes
 How many replica to contact before answering/acknowledging

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 09-25

Concurrent Write
 Different clients may try to update an item simultaneously
 If nothing special is done, system could end with “split-

brain”

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou)

Write 5
Write 8

Final value 5 Final value 8

Replica 1 Replica 2

Timestamp is a
typical way to order
concurrent writes

Slide based on material by Alan Fekete

09-26

Ordering concurrent writes
 Associate timestamps on the writes, and let higher

timestamp win
 Node ignores a write whose timestamp is lower than the value

already there (Thomas Write Rule)

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou)

Write 5 (ts=2)
Write 8 (ts=1)

Final value 5 Final value 5

Ignore this change, as it arrives “late”

Replica 1 Replica 2

Slide based on material by Alan Fekete

09-27

Quorums

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou)

 Suppose each write is initially done to W replicas (out of N)
 Other replicas will be updated later, after write has been acked

to client
 How can we find the current value of the item when

reading?
 Traditional “quorum” approach is to look at R replicas
 Consider the timestamp of value in each of these
 Choose value with highest timestamp as result of the read
 If W>N/2 and R+W>N, this works properly
 any write and read will have at least one site in common (quorums

intersect)
 Any read will see the most recent completed write,
 There will be at least one replica that is BOTH among the W written

and among the R read
Slide based on material by Alan Fekete

09-28

Dynamo Mechanisms
 Vector Clock

 Aid for resolving version conflict

 Sloppy Quorum + hinted hand off
 Achieve the “always writable” feature
 Eventual consistency

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 09-29

Dynamo Read/Write Route
 Any node is eligible to receive client read/write request

 get (key) or put (key, value)
 The node receives the request can direct the request to the node that

has the data and is available
 Any node knows the token of other nodes
 Any node can compute the hash value of the key in the request and the

preference list
 A node has local knowledge of node failure

 In Dynamo, “A node handling a read or write operation is known as the
coordinator. Typically, this is the first among the top N nodes in the
preference list.”, which is usually the coordinator of that key unless that
node is not available.
 Every node can be the coordinator of some operation
 For a given key, the read/write is usually handled by its coordinator or one of

the other top N nodes in the preference list

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 09-30

Data Versioning
 A put() call may return to its caller before the update has

been applied at all the replicas
W < N  A get() call may return many versions of the same object/value.  R > 1 and R < N  Typically when N = 3, we may have W = 2 and R = 2  Challenge: an object may have distinct version sub- histories, which the system will need to reconcile in the future.  Solution: uses vector clocks in order to capture causality between different versions of the same object. COMP5338 "Advanced Data Models" - 2018 (Y. Zhou) 09-31 Vector Clock  “A vector clock is effectively a list of (node, counter) pairs. One vector clock is associated with every version of every object.”  E.g. [(n0,1),(n1,1)], [(n1,2),(n2,0),(n3,2)]  “One can determine whether two versions of an object are on parallel branches or have a causal ordering, by examine their vector clocks. If the counters on the first object’s clock are less-than-or-equal to all of the nodes in the second clock, then the first is an ancestor of the second and can be forgotten. Otherwise, the two changes are considered to be in conflict and require reconciliation. ”  [(n0,1),(n1,1)] < [(n0,2),(n1,1),(n3,1)]  [(n0,1),(n1,1)] ?? [(n0,2),(n2,1)] COMP5338 "Advanced Data Models" - 2018 (Y. Zhou) 09-32 Vector Clock Example COMP5338 "Advanced Data Models" - 2018 (Y. Zhou) A node with D1 when receiving D2 in a write request can determine that D2 is the latest and D1 should be garbage collected. A node with D1 or D2 when receiving D3 in a write request can determine that D3 is the latest and D1 or D2 should be garbage collected. A node with D3 when receiving D4 in a write request cannot determine which one should be kept and will save both. The conflict needs to be resolved by client with a new version and vector clock written. 09-33 Vector Clock More Examples  In a system with 6 nodes: n0~n5, for a key with preference list {n1~n4} which of the following vector clock pair(s) has(have) causal order:  [(n1, 1), (n2,2)] and [(n2,1)]  [(n1, 3), (n3,1)] and [(n1,2),(n2,1)]  [(n1,2),(n3,1)] and [(n1,4),(n2,1),(n3,1)] COMP5338 "Advanced Data Models" - 2018 (Y. Zhou) 09-34 Sloppy Quorum  Quorum members may include nodes that do not store a replica  Preference list is larger than N  Read/Write may have quorum members that do not overlap  Both read and write will contact the first N healthy nodes from the preference list and wait for R or W responses  Write operation will use hinted handoff mechanism if the node contacted does not store a replica of the data COMP5338 "Advanced Data Models" - 2018 (Y. Zhou) 09-35 Hinted Handoff COMP5338 "Advanced Data Models" - 2018 (Y. Zhou)  Assume N = 3. When C is temporarily down or unreachable during a write, send replica to E.  E is hinted that the replica belongs to C and it should deliver to C when C is recovered.  Sloppy quorum does not guarantee that read can always return the latest value Write set: (B,D,E)  Read set: (B,C,D) 09-36 COMP5338 "Advanced Data Models" - 2018 (Y. Zhou) References  Brad Fitzpatrick, “Distributed Caching with Memcached” Linux Journal” [Online]. August 1, 2004. Available: https://www.linuxjournal.com/article/7451.  Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. 2007. Dynamo: amazon's highly available key-value store. In Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles (SOSP '07). 205-220. 09-37 https://www.linuxjournal.com/article/7451 COMP5338 – Advanced Data Models Outline Key-Value Data Model From Keys to Hashes Memcached: motivation Memcached: features Memcached: the distributed caching Memcached: basic hash partition Outline Dynamo Service Level Agreements (SLA) Dynamo Techniques Summary Outline Partitioning Algorithm Hash Partition- Repartition Consistent Hashing Consistent Hashing Consistent hashing example Virtual Nodes Revisit: Memcached distributed hashing Replication Membership and Failure Detection Adding Storage Node Outline Read and Write with Replication Concurrent Write Ordering concurrent writes Quorums Dynamo Mechanisms Dynamo Read/Write Route Data Versioning Vector Clock Vector Clock Example Vector Clock More Examples Sloppy Quorum Hinted Handoff References