CIS 455/555: Internet and Web Systems
Paxos and BigTable December 7, 2020
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
1
Plan for today
n Transactions
n Consensus
n Paxos
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
2
NEXT
The Paxos algorithm
n How Paxos approximately works:
n Three different roles: Proposers, acceptors, learners
n Proposers can propose a new entry to append n Eachproposalhasauniqueversionnumber
n Acceptors accept proposed values
n Ifmultipleproposalsaremadeconcurrently,somenodesmayeven
accept more than one proposal
n But,ifanodehasacceptedaproposalwithversionnumbern,itwillnot accept any further proposals with version numbers smaller than n
n A proposal accepted by a majority of the nodes is decided as the final value, and is learned by the learners
n Note: Each proposal is for a specific entry
n Multiple instances of the protocol can be active concurrently n Example: Propose XYZ as entry 12 and ABC as entry 13
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
3
Phase 1a: Prepare
n Suppose a node A wants to propose a value X for entry n:
n Node A chooses a new version number v
n Node A sends PREPARE(n, v) to all other nodes
n Intuition: PREPARE(n, v) means
n May I make a proposal for entry n with version number v?
n If so, can you suggest a value I should use?
n Note: Fairness is not a goal; even though A is
suggesting X, it is happy with other values too n Why? (Remember what the values are!)
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
4
Phase 1b: Promise
n If a node B receives PREPARE(n, v) from A: n If B has already acknowledged a PREPARE(n, v’) with v’>v,
then it does nothing
n If B has previously accepted any proposals for entry n, it responds with PROMISE(n, v, v’, X’), where v’ is the highest version number of any proposal it has accepted for entry n, and X’ is the corresponding value
n Otherwise, B responds with PROMISE(n, v, -, -)
n Intuition: A PROMISE means
n I promise to never accept any proposal with version # < v
n You should choose value X' (if v' and X' are given), or: any value is fine with me (if v' and X' are not given)
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
5
Phase 2: Accept
n If A receives PROMISEs from a “quorum” of the other nodes, it issues ACCEPT(n, v, X'')
n X'' is the value from the PROMISE with the highest version number, or the original X if none of the PROMISEs had a value
n If B receives ACCEPT(n, v, X'')
n If B has already responded to a PREPARE(n, v') with v'>v,
then B does nothing
n Otherwise B accepts the proposal and sends ACCEPT(n, v, X”) to all the learners
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
6
Phase 3: Learn
n If a learner L receives ACCEPT(n, v, X”) from a “quorum” of the acceptors, it decides X”
n L then sends DECIDE(n, X”) to all the other learners
n If another learner receives DECIDE(n, X”), it decides X”
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
7
Let’s think about this
n What happens if two nodes concurrently propose different values?
n What is a good lower bound on the size of the quorum?
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
8
Example
Proposers/ Learners
P1 P2
ABCDE
P1: PREPARE(3,5)
A,B,D,E: PROMISE(3,5,-,-)
P1: ACCEPT(3,5,X)
P2: PREPARE(3,8)
A,B,D,E: PROMISE(3,8,-,-)
C: PROMISE(3,8,5,X)
C: Ignore
P2: ACCEPT(3,8,X)
A,..,E: ACCEPTED(3,8,X)
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
9
P1: DECIDE(3,X)
Chubby: Google’s Paxos
n Master handles reads/writes n WritespropagatedthoughPaxos n Readshandledlocally
n A master is elected from the replicas via Paxos
n Masterlease:severalseconds
n Ifmasterfails,anewonewillbeelected,butonlyaftermaster leases expire
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
10
Recap: Paxos
n Goal: Build a replicated service
n Multiple machines acting ‘as if’ they were a single machine n Can mask faults if not too many happen simultaneously
n Paxos implements an important building block: A consistent append-only log
n Useful to make the replicas agree on the order in which to process requests → prevent divergence
n More generally, consensus is useful in many other scenarios n But: Paxos assumes crash faults
n Not the only set of assumptions you can make © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
11
Plan for today
n Consensus n Paxos
n Bigtable
© 2020 A. Haeberlen, Z. Ives, V. Liu
University of Pennsylvania
12
NEXT
© 2020 A. Haeberlen, Z. Ives, V. Liu
© 2020 A. Haeberlen, Z. Ives, V. Liu
© 2020 A. Haeberlen, Z. Ives, V. Liu
© 2020 A. Haeberlen, Z. Ives, V. Liu
© 2020 A. Haeberlen, Z. Ives, V. Liu
© 2020 A. Haeberlen, Z. Ives, V. Liu
© 2020 A. Haeberlen, Z. Ives, V. Liu
Rows
n Name is an arbitrary string
n Access to data in a row is atomic
n Row creation is implicit upon storing data
n Rows ordered lexicographically
n Rows close together lexicographically usually on one or a small
number of machines
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
20
Columns
n Columns have two-level name structure: n family:optional_qualifier
n Column family
n Unit of access control
n Has associated type information
n Qualifier gives unbounded columns n Additional levels of indexing, if desired
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
21
Timestamps
n Used to store different versions of data in a cell
n 64-bits integers
n New writes default to current time, but timestamps for writes can also be set explicitly by clients
n Lookup Options
n “Return most recent K values”
n “Return all values in timestamp range (or all values)”
n Column families can be marked w/ attributes: n “Only retain most recent K values in a cell”
n “Keep values until they are older than K seconds”
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
22
CS 457/557 Introduction to Distributed Systems 27
Tablet & Splitting
n A Bigtable table is partitioned into many tablets based on row keys
n Each tablet is served by one tablet server
“language:”
“contents:”
EN
“…”
“com.aaa” “com.cnn”
“com.cnn/sports.html”
Tablets
“com.website” …
“com.yahoo/kids.html”
Tablet: Start: com.aaa End: com.cnn
…
“com.yahoo/kids.html?d”
… “com.zuppa/menu.html”
© 2020 A. Haeberlen, Z. Ives, V. Liu
Locating Tablets
n Since tablets move around from server to server, given a row, how do clients find the right machine?
n Tabletproperty–startRowIndexandendRowIndex
n Needtofindtabletwhoserowrangecoversthetargetrow
n One approach: could store everything in the master n Centralserverwouldbethebottleneck(evenmorethanGFS)
n Instead: store special tables containing tablet location information in BigTable cell itself
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
24
BigTable Lookups
© 2020 A. Haeberlen, Z. Ives, V. Liu
Each METADATA record ~1KB Max METADATA table = 128MB
Addressable memory (3 tiers) = 221 TB University of Pennsylvania
25
Unsplittable
CS 457/557 Introduction to Distributed Systems 33
BigTable System Architecture
Bigtable cell
Metadata ops
Bigtable client
Bigtable client library
Bigtable master
Bigtable tablet server
serves data
serves data
performs metadata ops,
Open()
load balancing serves data
Read/write
Bigtable tablet server
Bigtable tablet server
© 2020 A. Haeberlen, Z. Ives, V. Liu
Storage
holds tablet data, logs
holds metadata, handles master-election
Paxos
Some services that use Bigtable
Source: Bigtable paper (OSDI 2006)
n At the time, there were 388 non-test Bigtable clusters at Google
n Combined total: 24,500 tablet servers
n Example: Google Analytics
n Raw click table (~200TB): 1 row for each end-user session
n Summary table (~20TB): Predefined summaries per website
© 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
27
Flashback
n Bigtable uses many of the technologies we’ve
been looking at in this course:
n Lock service is made fault-tolerant with Paxos
n Tablet location hierarchy is basically a B+ tree
n Clients can run per-row transactions
n Data is persisted in a scalable file system, GFS
n Bigtable can be used as source or target for MapReduce jobs
n More details are in the paper:
n F. Chang, J. Dean, S. Ghemawat, W. Hsieh, D. Wallach, M. Burrows, T. Chandra, A. Fikes, R. Gruber: “Bigtable: A Distributed Storage System for Structured Data”, OSDI 2006
n http://research.google.com/archive/bigtable-osdi06.pdf © 2020 A. Haeberlen, Z. Ives, V. Liu University of Pennsylvania
28