COMP5338 – Advanced Data Models
Dr. Ying Zhou
School of Information Technologies
COMP5338 – Advanced Data Models
Week 6: Google Spanner
Outline
Motivation
Structure and Data Model
Distributed Query Execution
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-2
Motivation
Requirements for cross-datacenter replication
Initial use case is the back end of Google’s advertising services
Data are stored in 5 replicas across 3 or 5 data centers in USA
Can Bigtable structure support such a scale?
Does the underlying file system scale?
Would single master be a bottleneck?
The limitations of Bigtable
It is not designed as a general purpose storage system and can be
difficult to use for some kind of applications, especially OLTP
applications
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-3
Motivation (cont’d)
An initial solution was to build a semi-relational data model on top of
Bigtable (Megastore)
The performance of Megastore is not ideal
It still “lacked many traditional database features that application developers
often rely on. A key example is a robust query language, meaning that
developers had to write complex code to process and aggregate the data in
their applications.”
Used by many well-known Google applications: Gmail,Picasa, Calendar, etc
Spanner evolved from a Bigtable-like versioned key-value store into a
temporal multi-version database
Data is stored in schematized semi-relational tables
Data version is automatically timestamped with its commit time
It provides a SQL-based query language and supports general-purpose
long-lived transactions.
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-4
Outline
Motivation
Structure and Data Model
Distributed Query Execution
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-5
Spanner Structure
A spanner deployment is called a universe
Universe master and placement driver
Both are singletons
Universe consists of many zones
Zone is the rough analogue of Bigtable cluster
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-6
Zone Structure
Zone is the rough analogue of Bigtable cluster
A zone consists of
Many Spannerservers
Serve data to clients
One Zonemaster
Allocate data to spanserver
Location proxies
Help clients to locate the spanserver assigned to their data
Each spansever manages between 100 to 1000 tablets
Tablet is similar to Bigtable’s tablet abstraction
It stores versioned data of the format
(key:string, timestamp:int64) → string
Actual data files and logs are stored on append only
Colossus (the successor of GFS)
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-7
Data Replication
Two levels of replication
Locally within the zone
Not managed by Spanner
Across zones
Managed by Spanner
Universe master
Primarily a console to monitor zone status
Placemant driver
Handles data movement across zones ‘
Load balancing
Satisfiying replication constraints
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-8
Spanner Software Stack
Paxos algorithm is used to support replication and
The set of all replicas of a tablet forms a Paxos group
The leader uses locks to implement concurrent write and is act as
transaction manager
There are many Paxos groups in the whole universe
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-9
Paxos group
A zone
Coordination activities involved
A universe is expected to receive many transactions
Concurrent transactions need to be executed in consistent
order in the replicas involved
Replica 1: transaction a, b, c
Replica 2: transaction a, b, c
Activities involved:
Ensure consensus on the transaction order among all replicas
Execute the transactions according to the agreed order
Paxos algorithm is used to ensure consensus on the order
The replicas form a Paxos group
Lock mechanism is used to control the concurrent
execution
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-10
Paxos algorithm
The name Paxos is from the original paper “The Part-time Parliament”
by Leslie Lamport
It refers to a Greek Island
The paper describe a parliament with legislators constantly wonders in
and out of the parliamentary chamber
Each legislator keeps a ledger recording the sequence of decree passed
E.g.
155: The olive tax is 3 drachamas per ton
132: Lamps must use only olive oil
The challenge is to ensure consistency of all ledgers with legislators
wondering in and out
The consensus algorithm proposed is called Paxos
Determining which transaction to run at which time point fits perfectly
with this scenario
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-11
Coordination activities involved (cont’d)
Each transaction may contain a few queries
They should satisfy the basic ACID requirement
Transactions within the same Paxos group do not need extra
coordination
Consistency of the replicas are guaranteed by Paxos
Transactions across Paxos groups is coordinated by two phase
commit
Each involved group’s participant leader would join the protocol.
Data consistency within each zone is managed by
underlying file system
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-12
Directory and Placement
Data in a tablet are organized as directories
Directory is a bucket like abstraction representing a set of
contiguous keys in the tablet that shares a common prefix
A directory is the unit of data placement
A directory is also the unit of data movement between
Paxos groups
Spanner tablet is quite different to Bigtable tablet
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-13
dir1 and dir2 belongs
to a tablet that is
currently replicated on
zone 1, 2, 3
dir2 may be moved to
another tablet
replicated on zone 2, 3
and 4
Revisit: data partition in MongoDB
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-14
Each chunk contains a contiguous range of sharding key values (or the hash of it)
The chunks belonging to a shard do not have to be next to each other in terms of sharding key space
Each shard stores a number of chunks
Each shard does not manage a contiguous range of sharding key values (or the hash of it)
Chunks can move around shards
stored
in
Each shard is a replica set
Revisit: data partition in Bigtable
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-15
Row Key Space
tablet1 tablet2
Min value Max value
tablet3 tablet4 tablet5 tablet6
Tablet server1 Tablet server2 Tablet server3 Tablet servern….
assigned
to
Each tablet contains a contiguous range of row key values
Each tablet server manages a number of tablets
Bigtable tablets may vary in size; while MongoDB chunks are of fixed size
Each tablet server does not manage a contiguous range of row keys
Tablet may change its managing servers, but that does not necessarily result in data movement
Tablet may merge or split
Replication is managed by underlying GFS
Spanner data partition
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-16
Key Space
dir1 dir2
Min value Max value
dir3 dir4 dir5 dir6
…
dirn
tablet1 tablet2 tablet3 tabletk
zone1 zone2 zone3 zone4
replicated
replicated
Each directory contains a contiguous range of keys sharing a common prefix
Each tablet contains a number of directories not necessarily next to each other
Each tablet does not contains a contiguous range of keys
Each tablet is replicated to a number of zones, the replication is managed by Spanner
Within zone, the replication is managed by the underlying file system: Colossus
Data Model
Each table is required to have one or more columns
specified as primary key
Each table defines a mapping from the primary-key columns
to the other columns
(primary-key:string, timestamp:int64) → other columns
Tables can have hierarchies defined by the client when
creating the table
The hierarchy determines the key-value pairs in directories
and in a tablet
A spanner tablet may contain data from more than one tables
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-17
Data Model Example
Schema for storing photo
meta data on a per-user,
per-album basis
The Users table is declared
as the parent table
The Albums is declared as
the child table
The child table is co-located
with the parent table
Similar to pre-joined tables in
some RDBMS
The child table includes the
parent table’s key as its
primary key
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-18
Common prefix
Outline
Motivation
Structure and Data Model
Distributed Query Execution
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-19
Query Plan Generation
General process
The query compiler first transforms an input query into a relational
algebra tree
Based on the actual schema and data properties, the optimizing
compiler rewrites the initial tree into an efficient plan via
transformation rules
Well-known transformation regarding local optimization
Spanner specific transformation for distributed query
General principles for distributed query plan
Always do local operations first (in parallel) then merge the results
Several explicit distribution operator
A Distributed Union operator is used to ship a subquery to each
shard, and to concatenate the results
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-20
Distributed query compilation
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-21
Customer(1)
Sales(1, ‘a’)
Sales(1, ‘b’)
Sales(1, ‘c’)
Customer(2)
Sales(2, ‘b’)
Sales(2, ‘c’)
Customer table is sharded on ckey and Sales
table is interleaved in Customer and
sharded by the same key
The query returns top 5 customers from
the list customer_key_arr
by total sum of sales of a particular kind
‘global’.
General Process
Put a Distributed Union operator at the bottom of the tree,
above every table in the query
Pull up this operator as much as possible
Similar as pushing down as much as possible other
operators
Any operators pushed below a Distributed Union should
satisfy a property called partitionability
An operator F satisfying partitionability means that performing an
ordered union of the results of applying F to each shard in table key
order gives the same outcome as applying F to the results of a global
scan
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-22
Operators that be pushed down
Basic operations like projection and filtering below
Distributed Union
Joins between interleaved tables if the join key is the
sharding key
Those tables are co-located based on the sharding key
Operators that can be partially processed locally
E.g. Global TopN can be obtained by computing local TopNs on
each shard
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-23
Execution Plan
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-24
CrossApply(input, map) evaluates ‘map’ for
every tuple from ‘input’ and concatenates the
results of ‘map’ evaluations.
Distributed Execution
At runtime, Spanner needs to work out where to send the
subqueries: the shards and the servers managing the
shards
Note: Google uses shard and tablet interchangeably in 2012 paper,
and only shard in 2017 paper.
It has slightly different meaning to shard used in MongoDB
If the query expression contains filtering based on the
sharding key
This filtering would be pushed further down
A subset of shards can be obtained and contacted
If the query needs to visit every shard
A single call will be send to every server that managing some shards
of the table
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-25
Query Distribution API
The single-consumer API is used when a single client
process consumes the results of a query.
Typical API supported by most storage systems
The parallel-consumer API is used for consuming query
results in parallel by multiple processes. E.g. in map-reduce
style processing
Special API for very large data set
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-26
Single-Consumer API
Spanner does not have designated coordinator for handling
distributed query
In MongoDB, mongos is the dedicated coordinating service
Theoretically any server hosting the data can function as a
root server to coordinate the execution
Using the server that owns all or some of the data would
reduce unnecessary network traffic
Root server may need to merge partial results or do further
processing
Spanner uses a mechanism called location hint to ensure
that would happen most of the time
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-27
Parallel-consumer API
Use case:
Query results need to be further processed
Query results are too big to be processed in a single machine
Solution
Sending partial results directly to the processing machines
Restriction
Only queries that are root partitionable can use parallel-consumer API
Distributed Union is at the root of the operator tree
E.g. the final result is just a simple concatenation of the subqueries’ results
Subquery results are sent direct to processing node
Process
The API needs to know the desired degree of parallelism and work out a set
of opaque query partition descriptors
The query is executed on individual partitions, initiated from the processing
nodes, e.g. the processing nodes become the clients
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-28
Key takeaway points
A possible solution to distributed querying involving multiple tables
MongoDB shows an example of running distributed query involving single collection
Spanner uses co-located parent-child tables (pre-join solution)
General rules of building query plan for complex distributed queries
Special support for big data processing framework
Difference between logical and physical data models
It is possible to have a semi-relational logic model build on physical layers totally different to
classic RDBMS
Indexing on non primary keys are not mentioned but very likely not supported
Fault tolerance is a key issue but is not explained in enough detail
E.g. There is a whole section on Query Restart, but not much technical details are
given
Transaction support is the main topic in OSDI’12 paper, check the presentation
video for more details
https://www.usenix.org/conference/osdi12/technical-sessions/presentation/corbett
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-29
https://www.usenix.org/conference/osdi12/technical-sessions/presentation/corbett
References
Corbett, James C, et al , Spanner: Google’s Globally-
Distributed Database Proceedings of OSDI, 2012
Bacon, David F., et al. “Spanner: Becoming a SQL
System.” Proceedings of the 2017 ACM International
Conference on Management of Data. ACM, 2017.
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 06-30
COMP5338 – Advanced Data Models
Outline
Motivation
Motivation (cont’d)
Outline
Spanner Structure
Zone Structure
Data Replication
Spanner Software Stack
Coordination activities involved
Paxos algorithm
Coordination activities involved (cont’d)
Directory and Placement
Revisit: data partition in MongoDB
Revisit: data partition in Bigtable
Spanner data partition
Data Model
Data Model Example
Outline
Query Plan Generation
Distributed query compilation
General Process
Operators that be pushed down
Execution Plan
Distributed Execution
Query Distribution API
Single-Consumer API
Parallel-consumer API
Key takeaway points
References