COMP5338 – Advanced Data Models
Dr. Ying Zhou
School of Information Technologies
COMP5338 – Advanced Data Models
Week 5: Column Store and Google Bigtable
Administrative
There will be a quiz on Week 6
Covers week 1- week 5 content
Paper based
It is running on Tuesday evening 8-9pm
All Tuesday classes please go to your allocated tutorial rooms
All Wednesday classes please stay in lecture theatre for the quiz
There is no regular tutorial on week 6
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-2
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou)
Outline
Overview
Row Store vs. Column Store
Bigtable motivation
Bigtable Data model
Bigtable Architecture
05-3
Organization of Disk Based Storage
System
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-4
MongoDB file structure
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-5
Column Store From RDBMS Perspective
Row store is easy to add/modify a record but might read in
unnecessary data if a row contains many columns
Good for OLTP type of application
Column store is good for read and analysis relevant data but
requires multiple accesses to update a row
Good for OLAP (data warehouse type of application)
The only fundamental difference is storage layout!
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou)
From Stavros Harizopoulos, Daniel Abadi, Peter Boncz VLDB 2009 Tutorial
05-6
Row Store vs. Column Store
Row store or NSM (N-ary Storage Model) is used in most database
management systems
Many relational database systems
Considered in general as write optimized
MongoDB is a “row store”
All data in a document is placed contiguously in storage
Schema less feature makes storage design more challenging as document may
grow or shrink in an unpredictably way
Compression is less efficient as row contains various data
Column store or DSM (Decomposed Storage Model)
The idea is proposed in 1985, the real practical modern implementation is C-
Store from MIT by Stonebraker et. al in 2005
Google’s BigTable is influence by DSM principle
With distinct key-value features
So does HBase
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-7
Bigtable Motivation
Some of Google’s daily business
Query
A whole copy of the web
Links between pages
Personalized Search
User’s query history, click streams
Google Analytics
Traffic data (who visits what at what time, for how long)
Google Earth
Satellite images, geo information
And so on..
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-8
How are Data Accessed
Web pages
Scanned to build inverted index (word -> page)
Unstructured, sequential read
Page meta data, links between pages
Used to rank pages, to compute PageRank algorithm
Structured, random access, mainly point queries
Query history, click streams
Used to build profile and run personalization algorithm
Structured, random access, point or range query
Traffic data
Used to build summary statistics
Structured, random access, point or range query
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-9
Google Storage Systems
Typical Data/Access Features
Massive scale data set of structured or unstructured data
Sequential or simple random access, majority of the data updates are
“append”
Storage systems to cater for such data storage/access
Google File System (SOSP’03 paper)
Unstructured data, sequential access
BigTable (OSDI’06 paper)
Structured data, random access
More recent storage system to cater for developers’ desire to use SQL
MegaStore (CIDR’11 paper)
Build on top of Bigtable, an effort to combine the scalability of NoSQL and the
convenience of a RDBMS
Spanner (OSDI’12 paper)
A successor to BigTable with more relational features and better performance
than MegaStore
There is a recent SIGMOD’17 paper focusing on how SQL is implemented
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-10
Outline
Overview
Bigtable Data model
Bigtable Architecture
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-11
Data Model
“A Bigtable is a sparse, distributed, persistent
multidimensional sorted map”
Basic concepts: table, row, column family, column,
timestamp
(rowkey:string, columnKey:string, timestampe:int64) -> value: string
Example bigtable to store web pages
Stores the data about home page of cnn website
The URL is “www.cnn.com”
The language is “EN”
The content is “ …”
It is referenced by two other pages
• Sports Illustrated (cnnsi.com) , using an anchor text “CNN”
• My-Look (my.look.ca), using an anchor text “CNN.com”
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-12
Relational Data Model vs Bigtable Model
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou)
url language content
“www.cnn.com” “EN” “ … ”
web table
link table
url referencingUrl anchorText
“www.cnn.com” “connsi.com” “CNN”
“www.cnn.com” “my.look.ca” “CNN.com”
language content anchor
“connsi.com” “my.look.ca”
“EN” “ …” “CNN” “CNN.com”“com.cnn.www”
Row key
column
family
column family
with two
column keys
column
family
05-13
Rows
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou)
language content anchor
“com.cnn.www”
“com.cnn.www/WORLD”
“com.cnn.weather”
“com.cts.www”
sorted
Row keys are arbitrary strings
Read/write of data under a single row key is atomic
Row keys are sorted in lexicographic order
Large table is dynamically partitioned by row key ranges
Each partition is called a tablet
Nearby rows will usually be served by the same server
Accessing nearby rows requires communication with a small number of
machines
05-14
Table Splitting
A table starts as one tablet
As it grows it splits into multiple tablets
Approximate size: 100-200 MB per tablet by default
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou)
language content anchor
“com.cnn.www”
“com.cnn.www/WORLD”
“com.cnn.weather”
“com.cts.www”
One tablet
05-15
Table Splitting (cont’d)
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou)
language content anchor
“com.cnn.www”
“com.cnn.www/WORLD”
“com.cnn.weather”
“com.cts.www”
language content anchor
“com.nytimes.www”
“com.seattletimes.www”
“com.washingtonpost.www”
“com.zdnet.www”
sorted
last key
last key
05-16
Columns and Column Families
Relational model only has “row” and “column” concepts
Bigtable has “row”, “column” and “column family” concepts
Column family
Just a group of columns with a printable name
Each column inside a column family has a column key
Column key is named as family:qualifier
Column family can be viewed is a convenient way to store
“collection” type data at design level
It also determines how table’s data are stored
Column family is the basic unit of data access
Data stored in a column family is usually of the same type
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-17
Columns and Column Families (cont’d)
Column Family is part of the schema definition
When we create a table, we also create a few column families by
specifying their names
The number of column families in a table is typically small and
relatively stable
Less than hundred
A column family theoretically can have unlimited number of columns
The row could be very “wide”
E.g. a popular web page in the web table may be referenced by
thousands, or even millions of other pages
Implications: we may have some tablet storing only one row!
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-18
Column Family Examples
The web table example has three column families
“language” — with only one column to store a web page’s language
Each web page can only have one language
Just like a normal column in relational table
Column key is “language:”
“content” — again with only one column to store the actual HTML text
Column key is “content:”
“anchor” — with dynamic number of columns
Each web page may be referenced by different number of other pages
E.g. www.cnn.com page has two referencing sites
Column key is “anchor:
• Question: Why can’t we use “anchor:
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-19
Timestamps
Classic relational model can only store the “current” value of
a particular row and its columns
Temporal DB may be able to store valid/transaction time
Bigtable stores multiple versions of a column by design
Version is indexed by a 64-bit timestamp
System time or assigned by client
If system time is used, this is equivalent to transaction time
Client assigned time can have various meanings
Per-column-family settings for garbage collection
Keep only latest n versions
Or keep only versions written since time t
Retrieve most recent version if no version specified
If specified, return version where timestamp ≤ requested time
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-20
Web Table with Timestamp
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou)
language content anchor
“com.cnn.www” “EN” t1 t1“…”
t2“…”
t6“…”
t9“CNN” t7“CNN.com”
The sorted map concept
(rowkey:string, columnKey:string, timestampe:int64) -> value: string
Examples:
(“com.cnn.www”, “language:”, t1) -> “EN”
(“com.cnn.www”, “anchor:consi.com”, t9) -> “CNN”
“my.look.ca”“connsi.com”
05-21
Typical APIs
Data definition API
Create/delete table and column families
Update table/column family metadata
Data Manipulation API
Write or delete value as specified by rowkey and some column
qualifier
Look up specific row by row key
Scan a short range of rows
Support single row transaction
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-22
Outline
Overview
Bigtable Data model
Bigtable Architecture
Immutable SSTable
Master-Tablet Server Architecture
Chubby Services
Read/Write Path
HBase
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-23
Data Storage
Google File System (GFS)
Is used to store actual Bigtable data (log and data files)
It provides replication/fault tolerance and other useful
features in a cluster environment
Google SSTable file format
Bigtable data are stored internally as SSTable format
Each SSTable consists of
Blocks (default 64KB size ) to store ordered immutable
map of key value pairs
Block index
The SSTable is stored as GFS files and are
replicated
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-24
Table-Tablet-SSTable
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou)
block
01
block
02
block
03
block
05
index
block
04
tablet 1 (minimal key, “pp”)
block
06
Block
07
index
block
08
block
09
tablet 2 (“pq”, maximum key)
Table
“aa”
“hi”
index
“a1”
“gt”
“kk”
“pq”
“sa”
“st”
“tt”
SSTable
SSTable
SSTable
Find row (“ba”): block 01 and/or block 03
Find row (“it”): block 02 or block 04
Find row (“ty”): block 09
05-25
Architecture
Many tablet servers
Can be added or removed dynamically
Each manages a set of tablets (typically 10-1,000 tablets/server)
Handles read/write requests to tablets
Splits tablets when too large
One master server
Assigns tablets to tablet server
Balances tablet server load
Garbage collection of unneeded files
Schema changes (table & column family creation)
It is NOT in the read/write path
Client library
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-26
Tablet Location
METADATA table contains the location of all tablets in the cluster
It might be very big and split into many tablets
The location of METADATA tablets is kept in a root tablet
This can never be split
Each tablet is assigned to ONE tablet server at a time.
Both ROOT and METADATA tablets are managed by tablet servers as
well
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-27
Chubby Services
Chubby is distributed lock service consists of a small
number of nodes (~5)
Each is a replica of one another
One is acting as the master
Paxos is used to ensure majority of the nodes have the latest data
Chubby allows clients to create directory/file and locks on
them
Lock has short lease time and needs to be renewed periodically
Usage in Bigtable
Ensure there is only one master
Keep track of all tablet servers
Stores the root table location
If Chubby becomes unavailable for an extended period of time,
Bigtable becomes unavailable.
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-28
Chubby and Tablet Servers
Tablet servers are able to join or leave a running cluster
without interfering the normal cluster operation
Chubby is used to keep track of tablet servers
Normal handling
Each server creates & locks a unique file in Server Directory when it
starts
The lock has short lease and needs to be renewed periodically
If a tablet server is scheduled to leave the cluster, it will release its lock
Error handling
A tablet server may lose the lock (e.g. expires)
• It will stops serving the tablets
• It will report to master that the lock is lost
• It will attempt to reclaim the lock if the file still exists, otherwise it
kills itself
A tablet server may crash and its file become orphaned
• Master will come to the rescue
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-29
Chubby and Master Operation
Master also obtains an exclusive master lock from chubby to
ensure there is only one master server
Master monitors Chubby’s server directory to find the
current list of tablet servers in the cluster
Master detects the status of tablet servers by periodically
ask each server for the status of its lock
Error handling
If tablet server is alive but has no lock or if the tablet server is
unreachable
The master will contact Chubby to acquire a lock on the orphaned server
file and delete it
The master also assigns all tablets to other servers
If a master cannot contact Chubby to renew its lock, it kills itself
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-30
Master Start Up
When a master is started
1. It grabs a unique master lock in Chubby
2. Find out all live servers
3. Communicate with all servers to find out what tablets they serve
4. Scan the METADATA table to find the total set of tablets in the
cluster
May discover tablets that are not assigned
5. Assign tablets without a server to a new tablet server
Any cluster has a root tablet, in step 3, the master may
Find the server that manages the root tablet and proceed with step
4
Find that the root tablet is not assigned to any server, the master
will assign it to a server and proceed with step 4
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-31
Tablet Assignments
Master knows the initial set of tablets during start up
process
Master assign tablets to servers to balance the load
The set may change
When tables are created or deleted
Two tablets are merged to form one
An existing tablet is split into two smaller ones
The master initiate the first two and can update tablet
assignment accordingly
The splitting is initiated by tablet server and the information
of the new tablet will be updated in the METADATA table
The tablet server also notifies the master of such change
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-32
Tablet Serving
Client read/write request
E.g. client wants to read the row corresponding to “com.cnn.www”
from the web table
Steps
Find the tablet location, the table server that serves the tablet
Contact the tablet server to perform the read/writhe request
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou)
master
Tablet server Tablet server Tablet server
chubbyclient
Read row
“com.cnn.www” from
web table
Bigtable
05-33
Find the tablet server
If the client is requesting the data for first time
One round trip from chubby to find the root tablet’s location
One round trip to the tablet server manages the root tablet
One round trip to the tablet server manages the METADATA tablet
The client caches the tablet location for later use
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou)
master
Tablet server Tablet server Tablet server
chubbyclient
Read row
“com.cnn.www” from
web table
Bigtable
1 Who manages the root tablet?
2 Who manages the MTADATA
tablet stores data about web
table “com.cnn.www” row ?
3
Who manages row
“com.cnn.www”
of web table?
05-34
Tablet Representation
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou)
Asynchronous
compaction
Sorted buffer stores recent updates
SSTable files of one tablet, which persists earlier updates Write operation
05-35
Tablet Representation Implications
A tablet server manages many tablets
Its memory contains latest updates of those tablets
BUT, the actual persisted data of those tablets might not be stored in
this tablet server
Logs and SSTable Files are managed by the underlying file system GFS
GFS might replicate the files in any server
Bigtable system is not responsible for actual file replication
and placement
The separation of concern simplifies the design
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-36
Write Path
A write operation may insert new data, update or delete
existing data
The client sends write operation directly to the tablet server
The operation is checked for syntax and authorization
The operation is written to the commit log
The actual mutation content is inserted in the memtable
Deleted data will have a special entry/marker
The only disk operation involved in write path is to append
update to commit log
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-37
Compactions
After many write operations, the size of memtable increases
When memtable size reaches a threshold
The current one is frozen and converted to an SSTable and written
to GFS
A new memtable is created to accept new updates
This is called minor compaction
Why minor compaction
Memory management of tablet server
Reduce the size of active log entries
Minor compaction persists the updates on disk
Log entries reflecting those updates are no longer required
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-38
Compactions (cont’d)
Every minor compaction creates a new SSTable
A tablet may contain many SSTable with overlapping key ranges
Merging compaction happens periodically to merge a few
SSTables and the current memtable content into a new
SSTable
Major compaction write all SSTable contents into a single
SSTable. It will permanently remove the deleted data.
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-39
Read Path
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou)
Asynchronous
compaction
Sorted buffer stores recent updates
05-40
Read Path
The client sends read operation directly to the tablet server
The operation is check for syntax and authorization
Both memory and disk maybe involved to obtain the data
What are kept in memory
Most recent updates in memtable (sorted by key)
Block indexes of SSTable files
What are kept in disk
Earlier updates persisted in one or many SSTable files
How does tablet server find the data
Check if the memtable contains partial data, or special mark indicating
certain data is deleted
Check the index to find the block(s) that may contain partial data
Load the block and extract the data if there is any
Combine the data from memtable and disk block to obtain the final result
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-41
Recover a Tablet
Tablet may be re-assigned to a new tablet server as part of
load balancing or recovery process
The assignment is initiated by master sending a load tablet
request to a tablet server.
Upon receiving such request, a tablet server performs the
following:
Scan the METADATA table to find information about this tablet
List of SSTables
Log file
Read the block indexes in memory
Play the log file to reconstruct the memory with all updates are not
yet persisted in SSTables
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-42
Refinements- Locality Group
Locality group consists of multiple column families specified
by client
There will be a separate SSTable for each locality group in
each tablet.
“column based” storage
Reasons
Bigtable support wide rows
Not all column families are required in most operation
Put column families that are typically access together in the same
group enables more efficient read
E.g. web page’s metadata and actual content can be put in different
groups
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-43
Sample Application – Google Analytics
Raw Click Table (~200 TB)
Row for each end-user session
Row name: {website name and time of session}
Sessions that visit the same web site are sorted & contiguous
Summary Table (~20 TB)
Contains various summaries for each website
Generated from the Raw Click table via periodic MapReduce jobs
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-44
What is HBase?
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-45
HBase is a column based NoSQL storage system based on
Google’s Bigtable data model and architecture
It is fully distributed
It is not a general purpose storage system
Google File System
Google Bigtable
Chubby lock service
Apache Zookeeper
HBase and Bigtable Nomeclature
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-46
Bigtable HBase
Tablet Region
Tablet Server Region Server
ROOT and METADATA tablet (two levels) hbase:meta table (one level)
SSTable HFile
memtable MemStore
Commit log Write-Ahead Log
Minor compaction Flush
Merging compaction Minor compaction
Major compaction Major compaction
GFS HDFS
Chubby Zookeeper
Locality Group By default, each column family is a locality group
References
Google Storage Stake Reading List:
Sanjay Ghemawat, Howard Gobioff and Shun-Tak Leung, The Google File System, In
Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP’03), 2003
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike
Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, Bigtable: A Distributed Storage
System for Structured Data, OSDI’06: In Proceedings of the Seventh Symposium on Operating
System Design and Implementation (OSDI’06), Seattle, WA, 2006
Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson, Jean-
Michel Leon, Yawei Li, Alexander Lloyd, Vadim Yushprakh et al. Megastore: Providing Scalable,
Highly Available Storage for Interactive Services. Proc. of CIDR. 2011, pp. 223–234.
Corbett, James C; Dean, Jeffrey; Epstein, Michael; Fikes, Andrew; Frost, Christopher; Furman, JJ;
Ghemawat, Sanjay; Gubarev, Andrey; Heiser, Christopher; Hochschild, Peter; Hsieh, Wilson;
Kanthak, Sebastian; Kogan, Eugene; Li, Hongyi; Lloyd, Alexander; Melnik, Sergey; Mwaura,
David; Nagle, David; Quinlan, Sean; Rao, Rajesh; Rolig, Lindsay; Saito, Yasushi; Szymaniak,
Michal; Taylor, Christopher; Wang, Ruth; Woodford, Dale, Spanner: Google’s Globally-
Distributed Database Proceedings of OSDI, 2012
Bacon, David F., et al. “Spanner: Becoming a SQL System.” Proceedings of
the 2017 ACM International Conference on Management of Data. ACM,
2017.
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 05-47
COMP5338 – Advanced Data Models
Administrative
Outline
Organization of Disk Based Storage System
MongoDB file structure
Column Store From RDBMS Perspective
Row Store vs. Column Store
Bigtable Motivation
How are Data Accessed
Google Storage Systems
Outline
Data Model
Relational Data Model vs Bigtable Model
Rows
Table Splitting
Table Splitting (cont’d)
Columns and Column Families
Columns and Column Families (cont’d)
Column Family Examples
Timestamps
Web Table with Timestamp
Typical APIs
Outline
Data Storage
Table-Tablet-SSTable
Architecture
Tablet Location
Chubby Services
Chubby and Tablet Servers
Chubby and Master Operation
Master Start Up
Tablet Assignments
Tablet Serving
Find the tablet server
Tablet Representation
Tablet Representation Implications
Write Path
Compactions
Compactions (cont’d)
Read Path
Read Path
Recover a Tablet
Refinements- Locality Group
Sample Application – Google Analytics
What is HBase?
HBase and Bigtable Nomeclature
References