Brainstorming…
Status quo ante
• Imagine you’re a SRE at BigDataCorp
Copyright By PowCoder代写 加微信 powcoder
Lots of customers, lots of transactions, lots of products, lots of interactions…
• Your line manager asks you to design the company’s new database to store/handle everything
What are your functional requirements?
• Semantics?
• Transactions?
• Access patterns?
• Heterogeneous applications?
What are your non-functional requirements?
• High throughput/low latency writes
• High throughput/low latency reads
• Scalability
• Fault tolerance
• High availability
E V E R Y T H I N G !!!
Status quo ante
• What is wrong with centralised relational databases?
• What is wrong with distributed relational databases?
• What is wrong with federated relational databases?
• What is wrong with distributed filesystems (a la HDFS)?
Atomicity
Consistency Isolation
Durability
Measures the power/potential (puissance) of Hydrogen atoms in a solution: -log10(c)
Lower pH more acidic constant motion (checks)
Higher pH more basic/alkaline lazy constraints
• Brewer’s CAP (theorem)
Pick two of:
• Consistency
• Availability
• Partition-tolerance
Basically available
Soft-state
Eventually consistent
* Source: https://www.dataversity.net/acid-vs-base-the-shifting-ph-of-database-transaction-processing/
Status quo ante
• What does this all mean for the system?
• What does this all mean for the applications? • When does each of these make sense?
Glossary (cont.)
• Data collection:
• Storage:
• Row store:
• Columnstore:
• When is each of these the best option?
• How do you scale this out? Partitioning + replication
• How is the data partitioned? Horizontally?
Vertically?
Both (explain)?
Neither (explain)?
• How do you decide what to replicate where? Performance?
Load balancing?
Fault tolerance?
Other considerations?
• How do you locate a replica when you need it?
Central “mastermind”?
Distributed directory?
Both (explain)?
Neither (explain)?
Disks are slow for random I/O but fast for sequential I/O
• Sequential I/O up to 3 orders of magnitude faster
• True for all of magnetic/rotational/solid state disks and to some extent also RAM
Sequential disk access can even be faster than random I/O in RAM
• Conclusion: avoid random I/O like the plague!
• Ok, but how?
Writes/insertions?
• Use a log — i.e., append-only approach
Reads/queries?
• How do you find something in a log?
• The disk paradox:
• Finding something in a log… ideas?
Sortthelogbykey;usebinarysearch
Split the log into partitions by hash of key; only search relevant partition
Use a tree-like structure at the storage level (e.g., B+ tree, ISAM, …) Use an external index
• What is wrong with these?
Need sorting or some predefined structure Whatiswrongwiththis?
Write throughput now suffers…
Externalindexnotthatbad
• … but has scalability limitations
Enter the Log-Structured-Merge (LSM) tree
• Batch writes/insertions, save them sequentially in separate (spill) files No single big index structure
• Sort each such file in memory before dumping to disk Subsequent reads/search will be faster
• Never touch a spill file once it is written out Files are immutable
• Updates though?
Go into new files
Write throughput still high!
• Reads though?
Need to inspect all spill files
• This doesn’t sound good… How can we solve it?
Periodically merge spill files
• Deletes though?
Tombstone records
I.e., every time we delete something, the data size increases !?!
• This doesn’t sound good… How can we solve it?
Periodically merge spill files…
• Done yet?
Not so fast… Still may need to go through many spill files…
Reading from a LSM tree
• How can we tell if we even need to scan a spill file or not?
• Use a Bloom Filter
A space-efficient structure — For representing sets
Ops: add element, remove element, check element for membership in
A set, S, of size n is represented by a bitmap, B, of size m
• B initialized to all-0
ToaddelementeintoS:
• i hash(e), i, in [1,m]
• Set the B[i] bit
TocheckifeisamemberofS:
• i hash(e)
• If B[i] == 1, then answer YES
• Else, answer NO.
• Note: false positives (WHY ?)
Fundamentally with BFs one trades accuracy (false positives) with space requirements
For many apps, false positives are acceptable !
Bloom Filters Overview
• To reduce the prob. of false positives: use
more than one hash function!
Add/remove: set/reset more than one bits of B Member: check more than one bits of B
• Parameteres:
Bitmap size, m (vs. the set size n)
k, the number of hash functions used
Bloom Filters Overview • Probability of false positives:
What is the probability that a bit is 0 after adding one element and using just one hash function?
• p1 = 1 – (1 / m) [assume all bits have same prob of being hit] What if we use k hash functions ?
• p2 = (1 – (1 / m)) k
So, what is the probability that a bit is zero, after we added n elements using k hash functions?
• p3 =(1-(1/m))kn
So, the probability of a bit being 1 after adding n elements with k hash functions is:
So, what is the probability of a false positive pfp after we added n elements?
• i.e., check for an element that is NOT in the BF and the BF incorrectly answers YES
• since using k hash functions, checking if k bits are set:
• pfp = (1 – ( 1 – ( 1 / m)) kn ) k ~= (1 e-kn/m)k
Actually this last step was found to be erroneous .. No independence between bitmap values … but still
a good approximation …
Optimal k, given m, n: kopt = (m/n) * ln(2)
Optimal m, given n, target pfp: mopt = -1.44 * log2(pfp) * n
• Optimal k then: kopt = -log2(pfp)
• E.g., for n = 1,000 and a target pfp = 0.05 (5%) kopt = 5, mopt = 6,224 bits = 778 bytes
BigTable/HBase
• What type of data store is BigTable? Row store
Column store
Both (explain)
Neither (explain)
BigTable/HBase components
• Master node
A la NameNode — HMaster for HBase
• Tablet server nodes
A la DataNodes — RegionServers for HBase
• Client libraries
• Rings a bell?
A distributed locking service (Zookeeper for HBase)
• GoogleFS for storage OrHDFSforHBase Keep this in mind!
Data model
• A sparse, distributed, persistent, multidimensional, sorted map
• Provides a table-like abstraction but it is not a RDBMS
No transactions
No normalization
No relational API
No proper schema
• Users define table name, column families; everything else is fair game!
Goal: allo clients to kno and eploit the phsical localities of their data!
• Data addressable by:
Rowkey (required)
Column name Timestamp
A 3-d map
• {rowkey, column name, timestamp} value
A key-value store
• key: {rowkey, column name, timestamp} , value: value
Data model
• Each table cell value is just an (uninterpreted) byte array (string) Includesrowkeysandcolumnqualifiers
• Row keys sorted lexicographically
• Columns are grouped into column families
Column name = “CF name:qualifier” (a.k.a. column key)
Small number (<10) of possibly very wide CFs (millions of columns) Column family name must consist of printable chars (why?)
• Data within a single CF stored together on disk
Actually, each CF has its own directory (/abe/CFx/) Allows for grouping of columns frequently accessed together Improved performance when accessing data
A column store?
Data model
• Each table cell may have multiple versions
Hence the need for a timestamp
A TS can be actual time or some user-defined quantity
Latest version of each cell returned by default
• Users can also specify
Explicit no. of max versions for the system to keep
Explicit max age (e.g. one week)
Allows for a sort of “garbage collection”
• No such thing as a “row” in the traditional sense
A row is simply the set of all cells of a given table sharing the same rowkey
Within each row, cells sorted lexicographically on column qualifiers
Within each set of cells with the same col. qual., cells sorted numerically by timestamp
Within a CF, all data for a row stored together on disk
A row store?
{1,c1,ts}: A
{1,c2,ts}: α
{1,c3,ts}: £
{2,c1,ts}: B
{2,c2,ts}: β
{2,c3,ts}: €
{3,c1,ts}: B
{3,c2,ts}: β
{3,c3,ts}: €
Data model
• So how do we scale this? Partitioning
Replication
• Replication taken care of by GoogleFS/HDFS
Moreonthislater
• What’s the best way to partition a large table across our servers? Per Column Family (a set of CFs per server)
Per column (a set of columns per server)
Per row (a set of rowkeys per server)
Other (explain)?
Data model
• A table is partitioned (split) horizontally into tablets (regions)
A tablet/region contains all rows in between a start rowkey and an end rowkey
• Each tablet/region is assigned to (stored and managed by) a tablet/region server
• Tablets/regions are the “unit of distribution and load balancing”
Data model
• Data collection:
• Key-value representation:
• Assume two regions: [1-2], [3-] 1st region:
2nd region:
{1,X:col1, ts}:A
{1,X:col2, ts}:α
{1,Y:col3, ts}:£
{2,X:col1, ts}:B
{2,X:col2, ts}:β
{2,Y:col3, ts}:€
{3,X:col1, ts}:B
{3,X:col2, ts}:β
{3,Y:col3, ts}:€
{1,X:col1, ts}:A
{1,X:col2, ts}:α
{1,Y:col3, ts}:£
{2,X:col1, ts}:B
{2,X:col2, ts}:β
{2,Y:col3, ts}:€
{3,X:col1, ts}:B
{3,X:col2, ts}:β
{3,Y:col3, ts}:€
HBase Internals
Data model: Example
• Rowkey: URL
Why is row key ... “reversed”?
• 2 column families
CF “contents”: no column qualifier
• The value is the html page contents • Upto3versions
Note the required schema flexibility, as more sites/links pointing to CNN emerge ...
CF “anchor”: column qualifier is the URL of anchor • Valueistheanchortext
HBase Internals
Data model
Put(rowkey, column_key, timestamp, value)
Get(rowkey)
• Returns all cells of row
Get(rowkey, column_key, timestamp) • Returns a specific cell
Get(List
• Returns a list of results, one per Get operation in the input list
• A.k.a. a multi-get
Scan(start_rowkey, end_rowkey)
• Returns all rows with keys in [start_key, end_key)
Delete(rowkey)
Delete(rowkey, timestamp)
Delete(rowkey, column_family [,timestamp])
Delete(rowkey, column_key [,timestamp]) …
• Basic OPs
Data model
• Put(rowkey, column_key, timestamp, value) • Get(rowkey)
• Get(rowkey, column_key, timestamp)
• Get(List
• Scan(start_rowkey, end_rowkey)
Data model • So… we can do selections
and projections
• How about:
Get all values for a column? Order by?
Group by?
Transactions? • Single-row?
• Multi-row?
• Put(rowkey, column_key, ts, value)
• Get(rowkey)
• Get(rowkey, column_key, ts)
• Get(List
• Scan(start_rowkey, end_rowkey)
BigTable/HBase Architecture
Architecture: Chubby/Zookeeper
A distributed service, which could be used for:
Locking services – implement consistency/isolation among operations
Stores key global data (e.g., the schemas of tables, tabletservers, …)
Access control (authorization for reads and writes)
Stores the “bootstrap” info for BigTable data
Used by the master to detect new tablet servers and to deal with the ones that must ‘die’
Highly available
Consists of “5” replicas
Chubby is live when a majority are up and can communicate with each other
One replica is MASTER, which actively serves client write requests
Use of quorums (among an odd number of servers)
PAXOS algorithm for consistency in the face of failures
Can elect a new MASTER, etc.
Architecture: Master and Tablet Servers
Creates a new tablet(region) / deletes a tablet(region) / merges two tablets(regions)
Assigns tablets to tablet servers
Detects new or ‘expired’ tablet servers
Load balancing among tablet servers
Schema management (eg addition/deletions of a table, CF, or column…)
Master is NOT part of client-server data flows
Think: NameNode, JobTracker, ResourceManager, etc. TabletServers/RegionServers:
Store and manage tablets
10s to 1000s of tablets per tablet server
Serve requests to read and write their tablets
Split tablets as they grow (eg > 100-200 MB)
Originally, there is one tablet per table
Tablet servers can be added/removed dynamically
Note: TabletServer splits tablets, while the Master merges/creates/deletes them
Each row in METADATA Table stores the location of
tablets of user tables, keyed by
Figure4: Tablet l
How are tablets located ?
Contains the location of ALL tablets
UserTable1
Root Tablet contains the locations of tablets of METADATA Table
METADATA tablets
Chubby file
Root tablet (1st METADATA tablet)
UserTableN
A 3-level Hierarchy for Tablet Location
With 128MB tablets, this can
ocation hierarchy.
map 234 user table tablets
How are tablets located ?
• Each client needs to access
1. The Chubby file (to find out the location of ROOT tablet)
2. The ROOT tablet
3. The appropriate metadata tablet
4. The actual user table tablet
3 network round trips to locate a tablet!
• How can we alleviate this? Caching!
The client library caches tablet location in memory, so
to reuse results from previous searches
The first two steps are cache hits w.h.p.
If cache is stale, up to 6 net round trips may be required!
lLocating Data in BigTable
Locate which tablet server(s) must be –
UserTable1
Other METADATA tablets
Chubby file
Root tablet (1st METADATA tablet)
UserTableN
Use BFs to locate which SSTables at tablet server hold relevant data
Use index block within selected SSTables to locate blocks that must be accessed
ure4: Table cationhierarchy. l
Key … Key t
On start up
Kills itself if looses lock
On termination
How are tablets assigned to tablet servers ?
Master keeps track of
Tablet servers and their assigned tablets Any unassigned tablets
Master assigns unassigned tablets to servers with room
Tablet Server
Create Lock File in SERVERS DIR and eXclusively lock it
Re-acquire X lock
Release X lock
Tablet Server
Monitors SERVERS DIR
Tablet Server
How are tablets assigned to tablet servers ?
Give me the X lock, if no response in (1), or
if TS lost its lock
Tablet Server
Status of lock ?
If Master succeeds to get the X lock, then it is certain that either: • TSisdown,or
• TSisunabletotalktoChubby
In both cases, the TS will never serve the tablet again
So the Master:
• Deletes the X-lock file from the SERVERS DIR and
• Marks the tablets of this TS as “unassigned”
Master StartUp
• Started by the cluster management system
• Master needs to find
current tablet assignment and
deal with any unassigned tablets
Get a unique MASTER LOCK
Master Chubby
TS . . . TS
Read SERVERS DIR to find all TSs
Talk to all TSs to find all assigned tablets
Scan METADATA Table to find all tablets and any unassigned tablets
Writing and Reading Tablets
• When TS receives a write op:
1. TS will check Chubby file of authorized writers for authorization (typically in TS cache)
2. Update record is then written into commit log
3. Then, the write is written into memtable
• When TS receives a read op:
1. Check Chubby file for authorization
2. Perform read on a “merged view” from SSTables and memtable
3. Recall: SSTables and memtable are both lex-sorted structures
this merged view is constructed efficiently WHY ?
for recovery
For performance
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com