• Built at/by Facebook
– In support f Faceb Inbox Search
– Billions of writes per day
– Concurrent with reads from 250M users
Copyright By PowCoder代写 加微信 powcoder
– Till 2010; then HBase was adopted
– Designed by ex-Dynamo (Amazon) creators
• Released as open source in 2008
• Apache Incubator project in 2009
• Top Level Apache Project in 2010
• Datastax founded in 2010
• Nowadays: deemed one of the fastest key-value stores out there
Design Desiderata
Decentralized
Nosinglepointoffailure
– In contrast to GoogleFS, BgTabe, MR, Sa,
– Note: distributed is != decentralized (what does that mean ???)
Data Model: table-like
– NOT RDBMS
– Simple: user control over data format and layout
Highly scalable
– Multi-datacentre operation
Highly available – Replication
High write throughput
But without sacrificing read performance
Replication also used to reduce read latency e.g., choose closest replica
• BigTable/HBase
– Sparse map data model – GFS, Chubby, et al
• Dynamo (from Amazon)
• Distributedhashtable(DHT)
– Cluster organization (of data centre nodes) – Partitioning datasets across cluster nodes
• BASE(akaeventualconsistency)
– Client-tunable consistency/availability
High Write Throughput
• How is this accomplished?
• Sandad ick
– Use an in-mem buffer for writes Avoid disk IOs
– Use a ie-ahead commi log first (WAL), before writing to the in- mem buffer
Persistence + durability
– BatchWALupdates+append-onlylogwrites
Batch writes amortise disk writes to the WAL
• Why are append-only writes a good idea? • Think: commodity servers, rotational disks
Disk read/write time = seek delay + rotational delay + data transfer ime OS-side dereferencing costs (translate file offset to block address etc.)
• Possible loss of persistence?
• Think: write throughput vs write latency
• Think: read path complexity
Data Model
• Cells/columns and a CRUD API
• From BigTable/HBase
– Distributed m-d map
– Indexed by a key
– Key-ae, hee ae ced
• i.e., a bunch of columns
• Data unit: Column
• Column Families ~= Tables in RDBMSs
• Super CFs:
– CFs can contain CFs
– E.g., in the inbox search app:
• A super column for each of the words in each msg
• A column family per word, with each column being the message ids that contain the word
• Term search query: uses the above super columns
Accessing Data
• Operations on a given row are atomic (per replica) – Regardlessofhowmanycolumnsareaccessed!
• Simple API, borrowed from DHTs: – insert(table,key,rowMutation)
– get(table,key,columnName)
– delete(table,key,columnName)
• Coordinator
– Pereachaccessrequest
– Receivesresultsandsendsthemtoclient
10/03/2022 Big Data Lectures: Cassandra 8
Accessing Data
• Based on Consistent Hashing
• Using a P2P design philosophy for the clusters network Decentralized, no single-point-of-failure
No master
No global knowledge on:
• Which nodes are in the cluster
• Which data items are stored in the cluster
• What is the load of each node
• Which are failed or non-eponie node
Minimal cost for when nodes join/leave • No global rehashing/re-placing of items
Detour: DHT Overlay Networks
• Wide-area (peer-to-peer) data systems
– Also usable/deployable within a single datacentre
• Hashtable-like API
– Status status = Put(Key k, Value v, Boolean overwrite = true) – Value v = Get(Key k)
Detour: DHT Overlay Networks
• Interface:
– Data item Insert(key) / Lookup(key) – Data item Delete(key) / Update(key) – Node Join() / Leave()
• m-bitIDsforbothnodesanddataitems
– IDs are arranged on the ID-circle (namespace)
• i.e., node/item IDs ε[0, 2m)
– Using pseudo-uniform random hash function for item
IDs and node IDs
• IDs are uniformly distributed
Detour: Consistent Hashing
• Item placement uses a successor function
• When a node n joins the network, it reclaims all items it would have stored if n was in the network when the items were added
– It contacts node succ(n) and retrieves all items k: ID(k) ≤ ID(n) ≤ succ(n).
• When n leaves the network, it sends all of its items to node succ(n)
– succ(): {0, 1, .. 2m-1} {0, 1, .. 2m-1}.
– Key k with ID ID(k), is assigned to node n with ID ID(n) s.t. ID(n) >= ID(k) ∄ : ID(n) > ID() >= ID(k)
150.148.140.23
198.132.112.1
150.148.140.2
150.148.140.33
198.132.112.22
142.112.140.12
no cars go
precious
this the life
sSuHccA(2-10()=PrNecious) = 20 SHA-1(150.148.140.23) = 0
Radiohead
• ID space 2m=64
Detour: Consistent Hashing
Detour: DHT Overlay Networks
• Routing state: each node knows its successor in ID circle
– Not enough: Lookups would take O(N) hops
• Each node knows (the IP addresses of) m (=logN) other
nodes, kept in a finger table:
– The ith row of the finger table of node n points to the node
responsible for ID (ID(n) + 2i-1) mod 2m – First row points to the successor of n
• Finger entries can point forwards and/or backwards (predecessors)
– i.e., ith row points to predecessor of (ID(n) – 2i-1) mod 2m
Detour: DHT Overlay Networks
look for key keID=hash(thisisthelife) = 20
find node with key =This Is the life
Detour: DHT Overlay Networks
• With high probability, the number of hops taken (nodes visited) during lookup() is
O(log2 N) in an N-node network. • Intuition:
– With m-bit IDs, the maximum distance that will be travelled is 2m
– At each step the search distance is halved
m = O(log2 N) hops
• This is achieved without global knowledge: each node
has log2 N state (its finger table) !
At large numbers we expect each node to be responsible for equi-length arcs load balancing
10/03/2022 Big Data Lectures: Cassandra 17
Actually, there are two placement methods:
• One baed on RandomPartitioner
– This uses a hash function with a random output in a given range of values (eg random ring position)
– A.k.a. bain-dead load balancing
– In practice, not very good
• One baed on OrderPreservingPartitioner
– This uses an order preserving hashing hashing function
– Function h(x) is o.p. if x <= y h(x) <= h(y)
– Great for range queries
– Can you see the problem with this data placement strategy?
• Given an item key, how do I find the node responsible for it?
• Different than classic DHTs (i.e., NO finger tables)
– Claic DHT deigned fo highl dnamic PP neok
• Be able to route from any node to any node without central global knowledge
• Lower cost of maintaining a consistent routing state
Limit the per-node routing state to O(log N) entries in N-node networks
– Cassandra designed for operation in datacentres
• Accrue global knowledge about the existence and location of all nodes
• BUT do so in a decentralized way!
– No master node storing such knowledge
• Uses a gossiping protocol
– Knowledge floods/propagates through the network
– Eventually all nodes become aware of it
– A.k.a. epidemic propagation / epidemic protocols
• Each node picks a position on the ring (its token)
• Token i gossiped aond he cle
– Every second, each node exchanges messages with 3 other
nodes of the same cluster
• These messages contain info about themselves (e.g., ID, ring position etc.) and about other nodes they have already gossiped with
Eventually, each node knows of
• Allothernodesinthesystem
• AND the arc of the ring they are responsible for
• When a node joins the cluster:
– It reads a config file conaining a fe seed node of he cle – These seeds are then used for initial gossiping
Voluntary departures: Node leaves
This IS the point of consistent hashing !
What about involuntary leaves (failures)? To tolerate failures, we need replication !
Replication in of replication, N = 3 Default replication scheme
Note: Nodes have different capacities!
Note: Nodes possibly belong in different DCs and different racks!
Node failure, replication factor = 3
Note: Replicas help in ensuring High Neighbours detect failures and recoveries Availability AND in Load Balancing! and try ensure desired replication factor
Replication in Cassandra
• eplica -- configurable
• Various placement schemes:
– SimpleSnitch • Default
• Rack unaware
• N-1 successive nodes – RackInferringSnitch
• Infers DC/rack from IP
• 1st replica goes to another rack/datacentre
• 2nd to same datacentre but different Rack
• Can be extended use files to describe node topology, etc.
– PropertyFileSnitch
• Configured w/ a properties file
• Follows same logic as above
– EC2Snitch
• For use when deployed over Amazon EC2
• Coordinator oversees replication
• Each node knows location of all replicas for all objects
Load Balancing? Virtual Nodes
Load Balancing? Virtual Nodes
Noe Ble and ello node ae moe poefl han ohe
So, virtual nodes help with LB in 2 ways:
1. Reducing arc-size inequalities
2. Addressing the node heterogeneity issues
Data Consistency
• The consistency issue in Cassandra:
– Is the data read the most recent one?
• Recall: there are several replicas
– How can I know if a replica is up-to-date or stale?
– CanIreadfromanyreplica?
– DoIneedtoreadfromseveral?
– Is this issue related to how many replicas are updated during writes?
• Such issues are dealt with using a replication control protocol
• Note: For some applications, reading stale data is acceptable!
– Can you think of any such application?
A replication protocol with primary/secondary replicas
Secondaries
Propagate Write(X)
Write(X)
Ack Write(X)
Commit Write(X)
What happens if primary goes down?
A replica can serve BUT, there is NO guarantee requests this replica is current!
This is the eventual consistency model (BASE)
A replication protocol with majority consensus
Send Write(X) to at least a majority
All replicas are equal
(i.e., no primary/secondary)
Each replica applying a Write(X) increments a local version number
Wait to receive ACKs
From at least a majority of replicas
A write(X) is committed ONLY if a majority of replicas participate
This implies a two-phase protocol
Phase 1: ensure that at least a majority are available; if not, abort write operation Phase 2: send writes to at least a majority
A read(X) involving a majority of replicas is certain to see the up-to-date state of X the one with latest version!
Consistency vs. Performance
• The enforced consistency level has serious performance implications!
• For READs:
– Reading the replica closest to me incurs the smallest delay
• So if I can READ FROM ANY id be gea
– Waiting for just one replica node to return the data is much faster than waiting for several!
• So if I can READ FROM ONE id be gea
• For WRITEs:
– Writing to one (or fewer) replicas is much faster than writing to many!
• However, having a primary replica has the disadvantages of:
– Risk for performance bottlenecks – Reduced consistency
Quorum-based replication
• A generalization of majority consensus
• A read quorum, r, is the number of replicas (r <= N) that must be
contacted during a READ
• A write quorum, w, is the number of replicas (w <= N) that must be contacted during a WRITE
• The following conditions must hold:
– r+ w > N
Guaranteed intersections of all operations at a single replica All READs guaranteed to see latest version!
• Tunable trade-offs
– For read-intensive objects / apps: r can be smaller ( w is larger) – For write-intensive objects / apps: w can be smaller ( r is larger) – This guarantees consistency without high performance overheads!
Consistency Levels
Cassandra lets apps decide among the following options.
WRITE Consistency Levels
Written to at least 1 node(including HH)
eplica commi log and memory table
N/2+1 replicas
LOCAL_QUORUM
N/2+1 replicas within local D.C.(only with cross D.C. strategy)
EACH_QUORUM
N/2+1 replicas within each D.C.(only with cross D.C. strategy)
Written to all replicas
HH i hined handoff If a node A canno appl a ie ee de o he un-availability of the node responsible for x, then:
• A oe a noe a la po i decibing he ie ee and
• Checks for node A and when it comes back up, it relays the write(x) to it.
READ Consistency Levels
Not supported
Returns record returned by first replica to respond
Returns record with most recent timestamp once at least N/2+1 replicas reported.
LOCAL_QUORUM
Returns record with most recent timestamp once at least N/2+1 replicas reported within local D.C.
EACH_QUORUM
Returns record with most recent timestamp once at least N/2+1 replicas reported within each D.C.
Returns record with most recent timestamp once all replicas have responded.
Cassandra vs. HBase
• Similarities:
– Both instances of LSM trees
• Plus per-file index, Bloom filters, compaction rounds, etc.
– Same (pretty much) data model
– Data partitioned horizontally
• Alhogh no a egion bondaie a in BigTable/HBase
• Default partitioner: RandomPartitioner (MD5-hash)
• What happens if we need to change the partitioning for an existing table?
– Data replicated across nodes • Differences:
– Cassandra generally faster overall
– No single point of failure in Cassandra
(no Master each node a poible mae
– Easily supports multi-datacentre deployments
– Somewhat easier to use out of the box (OpsCenter, CQL)
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com