COMP5338 – Advanced Data Models
Dr. Ying Zhou
School of Information Technologies
COMP5338 – Advanced Data Models
Week 1: Big Data, NoSQL and the Polyglot Persistence
Outline
The Value and limitations of SQL
Typical Scale-Out Options
NoSQL Storage Systems
Polyglot Persistence and Data Lake
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-2
The Relational Model of Data
Entity-Relationship (ER) Data Model describes data as
Entities – distinct objects in the domain
Relationships – between two or more entities
Entities – described using a set of attributes
Relationships – have attributes too
Used for Conceptual Database Design
Translated to final database implementation
Employee DepartmentWorks_In
TFN
name
desk
since dname
budgetdid
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-3
The Rational RDBMS
Commercial vendors: Oracle, IBM, Microsoft, …
Open source systems: MySQL, PostgreSQL, …
Common features
Disk-oriented storage;
Table stored row-by-row on disk
B-trees as indexing mechanism
Dynamic locking as the concurrency-control mechanism
A write-ahead log, or WAL for crash recovery
SQL as the access language
A “row-oriented” query optimizer and executor
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-4
http://cacm.acm.org/magazines/2011/6/108651-10-rules-for-scalable-performance-in-simple-operation-datastores/fulltext
The Value of Relational Databases
Store persistent data
Storing large amounts of data on disks, while allowing applications to grab
the bits they need through queries
Application Integration
Many applications in an enterprise need to share information, which might
happen at the database level
Concurrency Control
Database provide transactions to ensure consistent interaction when many
users access the same information at the same time
Mostly Standard
Relational model is widely used and understood.
SQL is the standard language.
Reporting
SQL’s simple data model and standardization has made it a foundation for
many reporting tools
http://martinfowler.com/articles/nosql-intro.pdf
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-5
http://martinfowler.com/articles/nosql-intro.pdf
The Scaling Problem of SQL
Relational databases are designed
to run on a single machine, so to
scale, you need buy a bigger
machine or increase capacity of
existing server (scale up)
But it’s cheaper and more effective
to scale out by buying lots of
machines.
http://martinfowler.com/articles/nosql-intro.pdf
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-6
http://martinfowler.com/articles/nosql-intro.pdf
The Fixed Schema Problem of SQL
In a relational database
Table structure are predefined
Tables are related with relationships, which are predefined as well
Schema evolution in RDBMS has large impact on queries
and applications
Example
MediaWiki had been through 171 schema versions between April
2003 and November 2007.
MySQL backend
~ 34 tables, ~242 columns, ~700GB in wikipedia (note: 2008 data)
Schema change has big impact on queries
Large number of queries could fail due to schema change.
http://yellowstone.cs.ucla.edu/schema-evolution/documents/curino-schema-evolution.pdf
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-7
http://yellowstone.cs.ucla.edu/schema-evolution/documents/curino-schema-evolution.pdf
World of Big Data
Big Data are high-volume, high-velocity, and/or high-variety
information assets that require new forms of processing to enable
enhanced decision making, insight discovery and process optimization.
[Gartner 2012 report]
Leaders in database research community identified “big data as a
defining challenging of our time” in a 2013 meeting.
Three major trends behind big data
“It has become much cheaper to generate a wide variety of data, due to
inexpensive storage, sensors, smart devices, social software, multiplayer
games and the Internet of Things, …”
“It has become much cheaper to process large amount data, due to
advances in multicore CPUs, solid state storage, inexpensive cloud
computing, and open source software”
“data management has become democratized. The process of
generating, processing, and consuming data is no longer just for database
professionals. Decision makers, domain scientists, .. and everyday
consumers now routinely do it”
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-8
D. Abadi, et al. “The Beckman report on database research”. Commun.
ACM 59, 2 (January 2016), 92-99
Schema Change is Unavoidable
News paper site example
Early days, for each news article, we may only record the following
information
Title, author, publishing date and time, actual content
Gradually we may want to record more about an article
Keywords, views, geotags, comments, who “favoured” it, who emailed it,
who twittered it … etc
Evolution of an application is inevitable
Accept it, incorporate it in the long-term plan for the system
Pick a system that allows schema evolution or have a strategy
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-9
Outline
The Value and limitations of SQL
Typical Scale Out Options
NoSQL Storage Systems
Polyglot Persistence and Data Lake
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-10
When Scalability Becomes an Issue
“Scalability is the capability of a system, network, or
process to handle a growing amount of work, or its potential
to be enlarged in order to accommodate that growth.”
[wikipedia: https://en.wikipedia.org/wiki/Scalability ]
In database context, the need for scaling occurs when the
size of the database and/or the traffic against it grows to
the point of crossing an optimal level of performance
Scale up
Scale out
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-11
https://en.wikipedia.org/wiki/Scalability
Scalability Scenario I
Persistent Storage Requirements
Medium data size (can fit in one server)
Typical query workload consists of large number of read request
and relatively low number of write request
Example: wikipedia
Only article meta data such as article revision history, articles
relations, user account and setting are stored in core relational
database system (MySQL)
Article text and images are stored separately
Key challenge
Scale to maintain reasonable read latency
http://www.nedworks.org/~mark/presentations/san/Wikimedia%20architecture.pdf
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-12
http://www.nedworks.org/%7Emark/presentations/san/Wikimedia%20architecture.pdf
Scalability Scenario I — Solution
Master-Slave Replication
Adopted by many companies
Also a typical approach to ensure durability
Example: Wikipedia has one Master database and many
replicas
http://www.nedworks.org/~mark/presentations/san/Wikimedia%20architecture.pdf
Master MySQL
Replica MySQL
Replica MySQL
Replica MySQL
Replica MySQL
replica setAll writes go to the master
They are pushed to all replicas
Reads are load balanced to one of the
replicas
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-13
http://www.nedworks.org/%7Emark/presentations/san/Wikimedia%20architecture.pdf
Scalability Scenario I — Implications
When the master dies
One of the replica can be elected as the new master
Some read may return old data if the latest value has not
been pushed from the master
It is possible to let Master handle read request for data requiring
strong consistency
Relatively easy to setup in most RDBMS
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-14
Scalability Scenario II
Persistent Storage Requirements
Medium or large data size (cannot fit in one server)
Typical query workload consists of large number of read request
and large number of write request
Example: flickr.com
Heavy write traffic: upload photos, adding tags, adding favourite, …
Over 400,000 photos being added every day. (note: 2007 data)
More than 4 billion queries per day. (note: 2007 data)
Uses MySQL as backend storage
Key challenge
Scale to maintain both read and write latency
http://mysqldba.blogspot.com.au/2008/04/mysql-uc-2007-presentation-file.html
http://highscalability.com/flickr-architecture
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-15
http://mysqldba.blogspot.com.au/2008/04/mysql-uc-2007-presentation-file.html
http://highscalability.com/flickr-architecture
Scalability Scenario II — Solution
Database Sharding
The process of slicing a database across multiple machines
Most likely horizontally (e.g., store all data related with a particular
user in one shard)
Shard-aware coordinator
Shard 1 Shard 2 Shard 3
user d,e,fuser a,b,c user g,h,i
Shard 1 replica
Shard 1 replica
Shard 1 replica
Shard 2 replica
Shard 2 replica
Shard 2 replica
Shard 3 replica
Shard 3 replica
Shard 3 replica
Find user a’s photo stream
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-16
Kept in MEMcache:
user_id -> shard id
photo_id-> owner_id
….
Scalability Scenario II – Flickr Example
User Bob adds User Alice’s photo 123 as “favourite”
Pulls the photo (123) owner’s account from cache (“Alice”), to get the
shard location
SHARD-5
Pulls Bob’s information from cache, to get Bob’s shard location
SHARD-13
Start a “distributed transaction”
http://mysqldba.blogspot.com.au/2008/04/mysql-uc-2007-presentation-file.html
User Bob adds
your photo 123
as favourite
User Alice’s
photo 123 is
faved by user
Bob
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-17
http://mysqldba.blogspot.com.au/2008/04/mysql-uc-2007-presentation-file.html
Scalability Scenario II — Implications
Data have to be de-normalized
E.g. in the previous example, the “fav” relation is stored in both Bob
and Alice’s record.
Join is too expensive when data are sharded
Sometimes join can’t be avoided, e.g. building friends network
Re-balancing or Re-Sharding is hard
What to do when data do not fit in one shard
Deciding on a partition factor/plan is hard
May generate hotspots
See the twitter example on next slides
Sharding is largely managed outside RDBMS
Recent version of RDBMS may provide limited support for sharding
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-18
Scalability Scenario II – Twitter Example
Twitter’s problem
To store 250 million tweets a day using MySQL
Twitter’s original Tweet Store:
Sharding based on time
Range partition (timestamp range)
The benefits: shards are filled one by one sequentially
The downsides: very unbalanced traffic
Shards with old tweets do not get any traffic
Twitter’s new Tweet Store:
Sharding based on random partition (id based)
A set of in-house systems to manage shards on top of MySQL
http://highscalability.com/blog/2011/12/19/how-twitter-stores-250-million-tweets-a-day-using-mysql.html
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-19
http://highscalability.com/blog/2011/12/19/how-twitter-stores-250-million-tweets-a-day-using-mysql.html
Outline
The Value and limitations of SQL
Handling Scalability
NoSQL Storage Systems
Polyglot Persistence and Data Lake
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-20
The Coming of NoSQL Storage Systems
There is no standard definition of NoSQL, the term came up
during a workshop on 2009 with presentations from
Voldemort, Cassandra, Dynomite, HBase,
Hypertable, CouchDB and MongoDB
Means “Not Only SQL”
Typical features
They don’t use the relational data model and thus don’t use the SQL
language
They don’t have fixed schema, allowing you to store any data in any
record
Many of them are designed to run on a cluster
Manage “sharding”, fault-tolerance, etc. efficiently
Many of them can be integrated with big data processing framework
such as MapReduce
http://martinfowler.com/articles/nosql-intro.pdf
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-21
http://martinfowler.com/articles/nosql-intro.pdf
NoSQL Ecosystem — Scalability
Distributed NoSQL systems
Designed to run on a cluster
Support automatically partitioning data across multiple machines
Machines can add or leave a running cluster
Handles failover, fault-tolerance
Example Distributed NoSQL systems
HBase, Cassandra,…
Non-distributed NoSQL systems
Designed to run on a single machine
Some has limited support for replication and sharding
Schema-less and “object” support
Example
MongoDB, Neo4j, etc..
http://www.rackspace.com/blog/nosql-ecosystem/
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-22
http://www.rackspace.com/blog/nosql-ecosystem/
NoSQL Ecosystem – Data Models
Document store
Has “table” like concept
Each “record” in a “table” is a semi-structured document
Examples: MongoDB, CouchDB
Column based store
Inspired by Google’s Bigtable structure
Has “table” like concept
Storage is organized around column family instead of “row”
Examples: Hbase, Cassandra
Key Value Store
Inspired by Amazon’s Dynamo storage
The overall storage is structured like a big hash table
May or may not have a “table” concept
Redis, Memcached, Voldemort, S3, Cassandra, DynamnoDB
Graph model
Storage is organized as nodes and edges
Neo4j
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-23
Outline
The Value and limitations of SQL
Handling Scalability
NoSQL Storage Systems
Polyglot Persistence and Data Lake
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-24
Polyglot Persistence
SQL is still useful in many ways
The relational model fits many real world data
ACID transactions are desirable in certain cases
Most NoSQL systems have no or very limited transactional support
Supporting tools
Standard query API
Polyglot persistence is quite common in large enterprise
using multiple data storage technologies, chosen based upon the
way data is being used by individual applications.
http://martinfowler.com/articles/nosql-intro.pdf
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-25
http://martinfowler.com/articles/nosql-intro.pdf
Data Lake
Data Lake is a term coined in 2015 referring to a collection
of datasets in various formats
Most organizations manage large number of datasets generated or
collected from different sources
Structured files, databases, spreadsheets, etc
It also refers to the analytic tools/services build on top of it.
There are tools for managing data lake in house or in the
cloud
Google GOODS, Azure Data Lake Storage, etc
This is still in early stage, with no standard requirement
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-26
Google’s GOODS
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-27
Halevy, Alon Y., et al. “Managing Google’s data lake: an overview of the Goods system.” IEEE
Data Eng. Bull. 39.3 (2016): 5-14.
Hype Cycle for Data Management, 2017
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-28
https://www.gartner.com/newsroom/id/3809163
In Conclusion
The rise of NoSQL databases marks the end of the era
of relational database dominance
But NoSQL databases will not become the new dominators.
Relational will still be popular, and used in the majority of
situations. They, however, will no longer be the automatic
choice.
Data lake allows data to be stored in various ways,
relational, nonrelational and also raw format as files
http://martinfowler.com/nosql.html
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-29
References
Paramod J. Sadalage and Martin Fowler, “NoSQL Distilled”, Addison-Wesley Professional; 1 edition
(August 18, 2012)
Chapter 1
Daniel Abadi, Rakesh Agrawal, Anastasia Ailamaki, Magdalena Balazinska, Philip A. Bernstein, Michael
J. Carey, Surajit Chaudhuri, Surajit Chaudhuri, Jeffrey Dean, AnHai Doan, Michael J. Franklin,
Johannes Gehrke, Laura M. Haas, Alon Y. Halevy, Joseph M. Hellerstein, Yannis E. Ioannidis, H. V.
Jagadish, Donald Kossmann, Samuel Madden, Sharad Mehrotra, Tova Milo, Jeffrey F. Naughton,
Raghu Ramakrishnan, Volker Markl, Christopher Olston, Beng Chin Ooi, Christopher Ré, Dan Suciu,
Michael Stonebraker, and Todd Walter, Jennifer Widom. 2016. The Beckman report on database
research. Commun. ACM 59, 2 (January 2016), 92-99. DOI=http://dx.doi.org/10.1145/2845915
Curino, Carlo A., Hyun J. Moon, Letizia Tanca, and Carlo Zaniolo. “Schema Evolution In Wikipedia.”
ICEIS, 2008.
http://yellowstone.cs.ucla.edu/schema-evolution/documents/curino-schema-evolution.pdf
High Scalability post: “How Twitter Stores 250 Million Tweets A Day Using MySQL”, 2011
http://highscalability.com/blog/2011/12/19/how-twitter-stores-250-million-tweets-a-day-using-mysql.html
Dathan Pattishall, “Federation at Flickr: Doing Billions of Queries per Day”, 2007
http://www.scribd.com/doc/2592098/DVPmysqlucFederation-at-Flickr-Doing-Billions-of-Queries-Per-Day
Halevy, Alon Y., et al. “Managing Google’s data lake: an overview of the Goods
system.” IEEE Data Eng. Bull. 39.3 (2016): 5-14.
Gartner Reveals the 2017 Hype Cycle for Data Management
https://www.gartner.com/newsroom/id/3809163
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-30
http://yellowstone.cs.ucla.edu/schema-evolution/documents/curino-schema-evolution.pdf
http://highscalability.com/blog/2011/12/19/how-twitter-stores-250-million-tweets-a-day-using-mysql.html
COMP5338 – Advanced Data Models
Outline
The Relational Model of Data
The Rational RDBMS
The Value of Relational Databases
The Scaling Problem of SQL
The Fixed Schema Problem of SQL
World of Big Data
Schema Change is Unavoidable
Outline
When Scalability Becomes an Issue
Scalability Scenario I
Scalability Scenario I — Solution
Scalability Scenario I — Implications
Scalability Scenario II
Scalability Scenario II — Solution
Scalability Scenario II – Flickr Example
Scalability Scenario II — Implications
Scalability Scenario II – Twitter Example
Outline
The Coming of NoSQL Storage Systems
NoSQL Ecosystem — Scalability
NoSQL Ecosystem – Data Models
Outline
Polyglot Persistence
Data Lake
Google’s GOODS
Hype Cycle for Data Management, 2017
In Conclusion
References