程序代写 • Built at/by Facebook

• Built at/by Facebook
– In support 􏱄f Faceb􏱄􏱄􏱅􏱀􏰤 Inbox Search
– Billions of writes per day
– Concurrent with reads from 250M users

– Till 2010; then HBase was adopted
– Designed by ex-Dynamo (Amazon) creators
• Released as open source in 2008
• Apache Incubator project in 2009
• Top Level Apache Project in 2010
• Datastax founded in 2010
• Nowadays: deemed one of the fastest key-value stores out there

Design Desiderata
Decentralized
Nosinglepointoffailure
– In contrast to GoogleFS, B􏱆gTab􏰾e, MR, S􏱇a􏰠􏱅, 􏰘
– Note: distributed is != decentralized (what does that mean ???)
Data Model: table-like
– NOT RDBMS
– Simple: user control over data format and layout
Highly scalable
– Multi-datacentre operation
Highly available – Replication
High write throughput
But without sacrificing read performance
Replication also used to reduce read latency e.g., choose closest replica

• BigTable/HBase
– Sparse map data model – GFS, Chubby, et al
• Dynamo (from Amazon)
• Distributedhashtable(DHT)
– Cluster organization (of data centre nodes) – Partitioning datasets across cluster nodes
• BASE(akaeventualconsistency)
– Client-tunable consistency/availability

High Write Throughput
• How is this accomplished?
• 􏰛S􏰔anda􏰬d􏰪 􏰔􏰬ick􏰖􏰅
– Use an in-mem buffer for writes Avoid disk IOs
– Use a 􏰛􏰋􏰬i􏰔e-ahead􏰪 commi􏰔 log first (WAL), before writing to the in- mem buffer
Persistence + durability
– BatchWALupdates+append-onlylogwrites
Batch writes amortise disk writes to the WAL
• Why are append-only writes a good idea? • Think: commodity servers, rotational disks
Disk read/write time = seek delay + rotational delay + data transfer 􏰔ime 􏰅 􏰅 􏰰 OS-side dereferencing costs (translate file offset to block address etc.)
• Possible loss of persistence?
• Think: write throughput vs write latency
• Think: read path complexity

Data Model
• Cells/columns and a CRUD API
• From BigTable/HBase
– Distributed m-d map
– Indexed by a key
– Key-􏱈a􏰾􏰮e, 􏰟he􏰠e 􏰣􏱈a􏰾􏰮e􏰦 􏱆􏰤 􏰣􏰤􏰗􏰠􏰮c􏰗􏰮􏰠ed􏰦
• i.e., a bunch of columns
• Data unit: Column
• Column Families ~= Tables in RDBMSs
• Super CFs:
– CFs can contain CFs
– E.g., in the inbox search app:
• A super column for each of the words in each msg
• A column family per word, with each column being the message ids that contain the word
• Term search query: uses the above super columns

Accessing Data
• Operations on a given row are atomic (per replica) – Regardlessofhowmanycolumnsareaccessed!
• Simple API, borrowed from DHTs: – insert(table,key,rowMutation)
– get(table,key,columnName)
– delete(table,key,columnName)
• Coordinator
– Pereachaccessrequest
– Receivesresultsandsendsthemtoclient
10/03/2022 Big Data Lectures: Cassandra 8

Accessing Data
• Based on Consistent Hashing
• Using a P2P design philosophy for the clusters network Decentralized, no single-point-of-failure
No master
No global knowledge on:
• Which nodes are in the cluster
• Which data items are stored in the cluster
• What is the load of each node
• Which are failed or non-􏰬e􏰖pon􏰖i􏰓e node􏰖 􏰅
Minimal cost for when nodes join/leave • No global rehashing/re-placing of items

Detour: DHT Overlay Networks
• Wide-area (peer-to-peer) data systems
– Also usable/deployable within a single datacentre
• Hashtable-like API
– Status status = Put(Key k, Value v, Boolean overwrite = true) – Value v = Get(Key k)

Detour: DHT Overlay Networks
• Interface:
– Data item Insert(key) / Lookup(key) – Data item Delete(key) / Update(key) – Node Join() / Leave()
• m-bitIDsforbothnodesanddataitems
– IDs are arranged on the ID-circle (namespace)
• i.e., node/item IDs ε[0, 2m)
– Using pseudo-uniform random hash function for item
IDs and node IDs
• IDs are uniformly distributed

Detour: Consistent Hashing
• Item placement uses a successor function
• When a node n joins the network, it reclaims all items it would have stored if n was in the network when the items were added
– It contacts node succ(n) and retrieves all items k: ID(k) ≤ ID(n) ≤ succ(n).
• When n leaves the network, it sends all of its items to node succ(n)
– succ(): {0, 1, .. 2m-1} {0, 1, .. 2m-1}.
– Key k with ID ID(k), is assigned to node n with ID ID(n) s.t. ID(n) >= ID(k) ∄ 􏱉􏱀: ID(n) > ID(􏱉􏱀) >= ID(k)

150.148.140.23
198.132.112.1
150.148.140.2
150.148.140.33
198.132.112.22
142.112.140.12
􏰦no cars go􏰨
􏰦precious􏰨
􏰦this the life􏰨
sSuHccA(2-10()􏰦=PrNecious􏰨) = 20 SHA-1(150.148.140.23) = 0
􏰦Radiohead􏰨
• ID space 2m=64
Detour: Consistent Hashing

Detour: DHT Overlay Networks
• Routing state: each node knows its successor in ID circle
– Not enough: Lookups would take O(N) hops
• Each node knows (the IP addresses of) m (=logN) other
nodes, kept in a finger table:
– The ith row of the finger table of node n points to the node
responsible for ID (ID(n) + 2i-1) mod 2m – First row points to the successor of n
• Finger entries can point forwards and/or backwards (predecessors)
– i.e., ith row points to predecessor of (ID(n) – 2i-1) mod 2m

Detour: DHT Overlay Networks
look for key ke􏰫ID=hash(􏰦thisisthelife􏰨) = 20
find node with key =􏰦This Is the life􏰨

Detour: DHT Overlay Networks
• With high probability, the number of hops taken (nodes visited) during lookup() is
O(log2 N) in an N-node network. • Intuition:
– With m-bit IDs, the maximum distance that will be travelled is 2m
– At each step the search distance is halved
m = O(log2 N) hops
• This is achieved without global knowledge: each node
has log2 N state (its finger table) !

At large numbers we expect each node to be responsible for equi-length arcsload balancing
10/03/2022 Big Data Lectures: Cassandra 17

Actually, there are two placement methods:
• One ba􏰖ed on 􏰛RandomPartitioner􏰪
– This uses a hash function with a random output in a given range of values (eg random ring position)
– A.k.a. 􏰛b􏰬ain-dead load balancing􏰪 􏰅
– In practice, not very good
• One ba􏰖ed on 􏰛OrderPreservingPartitioner􏰪
– This uses an order preserving hashing hashing function
– Function h(x) is o.p. if x <= yh(x) <= h(y) – Great for range queries – Can you see the problem with this data placement strategy? • Given an item key, how do I find the node responsible for it? • Different than classic DHTs (i.e., NO finger tables) – 􏰛Cla􏰖􏰖ic􏰪 DHT􏰖 de􏰖igned fo􏰬 highl􏰇 d􏰇namic P􏰎P ne􏰔􏰋o􏰬k􏰖 • Be able to route from any node to any node without central global knowledge • Lower cost of maintaining a consistent routing state Limit the per-node routing state to O(log N) entries in N-node networks – Cassandra designed for operation in datacentres • Accrue global knowledge about the existence and location of all nodes • BUT do so in a decentralized way! – No master node storing such knowledge • Uses a gossiping protocol – Knowledge floods/propagates through the network – Eventually all nodes become aware of it – A.k.a. epidemic propagation / epidemic protocols • Each node picks a position on the ring (its token) • Token i􏰖 􏰛gossiped􏰪 a􏰬o􏰂nd 􏰔he cl􏰂􏰖􏰔e􏰬 – Every second, each node exchanges messages with 3 other nodes of the same cluster • These messages contain info about themselves (e.g., ID, ring position etc.) and about other nodes they have already gossiped with Eventually, each node knows of • Allothernodesinthesystem • AND the arc of the ring they are responsible for • When a node joins the cluster: – It reads a config file con􏰔aining a fe􏰋 􏰛seed􏰪 node􏰖 of 􏰔he cl􏰂􏰖􏰔e􏰬 – These seeds are then used for initial gossiping Voluntary departures: Node 􏱊 leaves This IS the point of consistent hashing ! What about involuntary leaves (failures)? To tolerate failures, we need replication ! Replication in of replication, N = 3 Default replication scheme Note: Nodes have different capacities! Note: Nodes possibly belong in different DCs and different racks! Node 􏱊 failure, replication factor = 3 Note: Replicas help in ensuring High Neighbours detect failures and recoveries Availability AND in Load Balancing! and try ensure desired replication factor Replication in Cassandra • 􏱋 􏰬eplica􏰖 -- configurable • Various placement schemes: – SimpleSnitch • Default • Rack unaware • N-1 successive nodes – RackInferringSnitch • Infers DC/rack from IP • 1st replica goes to another rack/datacentre • 2nd to same datacentre but different Rack • Can be extended 􏰉 use files to describe node topology, etc. – PropertyFileSnitch • Configured w/ a properties file • Follows same logic as above – EC2Snitch • For use when deployed over Amazon EC2 • Coordinator oversees replication • Each node knows location of all replicas for all objects Load Balancing? Virtual Nodes Load Balancing? Virtual Nodes No􏰔e􏰳 Bl􏰂e and 􏰇ello􏰋 node􏰖 a􏰬e mo􏰬e po􏰋e􏰬f􏰂l 􏰔han o􏰔he􏰬􏰖 􏰅 So, virtual nodes help with LB in 2 ways: 1. Reducing arc-size inequalities 2. Addressing the node heterogeneity issues Data Consistency • The consistency issue in Cassandra: – Is the data read the most recent one? • Recall: there are several replicas – How can I know if a replica is up-to-date or stale? – CanIreadfromanyreplica? – DoIneedtoreadfromseveral? – Is this issue related to how many replicas are updated during writes? • Such issues are dealt with using a replication control protocol • Note: For some applications, reading stale data is acceptable! – Can you think of any such application? A replication protocol with primary/secondary replicas Secondaries 􏱃 Propagate Write(X) 􏰡 Write(X) 􏱂 Ack Write(X) 􏰢 Commit Write(X) What happens if primary goes down? A replica can serve BUT, there is NO guarantee requests this replica is current! This is the eventual consistency model (BASE) A replication protocol with majority consensus 􏰡 Send Write(X) to at least a majority All replicas are equal (i.e., no primary/secondary) 􏰢 Each replica applying a Write(X) increments a local version number 􏱂 Wait to receive ACKs From at least a majority of replicas A write(X) is committed ONLY if a majority of replicas participate This implies a two-phase protocol Phase 1: ensure that at least a majority are available; if not, abort write operation Phase 2: send writes to at least a majority A read(X) involving a majority of replicas is certain to see the up-to-date state of X 􏰉 the one with latest version! Consistency vs. Performance • The enforced consistency level has serious performance implications! • For READs: – Reading the replica closest to me incurs the smallest delay • So if I can READ FROM ANY􏰄 i􏰔􏰁d be g􏰬ea􏰔􏰸 – Waiting for just one replica node to return the data is much faster than waiting for several! • So if I can READ FROM ONE􏰄 i􏰔􏰁d be g􏰬ea􏰔􏰸 • For WRITEs: – Writing to one (or fewer) replicas is much faster than writing to many! • However, having a primary replica has the disadvantages of: – Risk for performance bottlenecks – Reduced consistency Quorum-based replication • A generalization of majority consensus • A read quorum, r, is the number of replicas (r <= N) that must be contacted during a READ • A write quorum, w, is the number of replicas (w <= N) that must be contacted during a WRITE • The following conditions must hold: – r+ w > N
Guaranteed intersections of all operations at a single replica All READs guaranteed to see latest version!
• Tunable trade-offs
– For read-intensive objects / apps: r can be smaller (w is larger) – For write-intensive objects / apps: w can be smaller (r is larger) – This guarantees consistency without high performance overheads!

Consistency Levels
Cassandra lets apps decide among the following options.

WRITE Consistency Levels
Written to at least 1 node(including HH)
􏰐 􏰬eplica􏰁􏰖 commi􏰔 log and memory table
N/2+1 replicas
LOCAL_QUORUM
N/2+1 replicas within local D.C.(only with cross D.C. strategy)
EACH_QUORUM
N/2+1 replicas within each D.C.(only with cross D.C. strategy)
Written to all replicas
HH i􏰖 􏰛hin􏰔ed handoff􏰪􏰳 If a node A canno􏰔 appl􏰇 a 􏰋􏰬i􏰔e􏰌􏰈􏰆 􏰬e􏰹􏰂e􏰖􏰔 d􏰂e 􏰔o 􏰔he un-availability of the node responsible for x, then:
• A 􏰖􏰔o􏰬e􏰖 a no􏰔e 􏰌a la 􏰀po􏰖􏰔 i􏰔􏰁􏰆􏰄 de􏰖c􏰬ibing 􏰔he 􏰋􏰬i􏰔e􏰌􏰈􏰆 􏰬e􏰹􏰂e􏰖􏰔 and
• Checks for node A and when it comes back up, it relays the write(x) to it.

READ Consistency Levels
Not supported
Returns record returned by first replica to respond
Returns record with most recent timestamp once at least N/2+1 replicas reported.
LOCAL_QUORUM
Returns record with most recent timestamp once at least N/2+1 replicas reported within local D.C.
EACH_QUORUM
Returns record with most recent timestamp once at least N/2+1 replicas reported within each D.C.
Returns record with most recent timestamp once all replicas have responded.

Cassandra vs. HBase
• Similarities:
– Both instances of LSM trees
• Plus per-file index, Bloom filters, compaction rounds, etc.
– Same (pretty much) data model
– Data partitioned horizontally
• Al􏰔ho􏰂gh no􏰔 a􏰔 􏰛􏰬egion bo􏰂nda􏰬ie􏰖􏰪 a􏰖 in BigTable/HBase
• Default partitioner: RandomPartitioner (MD5-hash)
• What happens if we need to change the partitioning for an existing table?
– Data replicated across nodes • Differences:
– Cassandra generally faster overall
– No single point of failure in Cassandra
(no Mastereach node a po􏰖􏰖ible 􏰛ma􏰖􏰔e􏰬􏰪􏰆
– Easily supports multi-datacentre deployments
– Somewhat easier to use out of the box (OpsCenter, CQL)

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts