程序代写代做代考 Agda concurrency database hbase cache SQL COMP5338

COMP5338 – Advanced Data Models

Dr. Ying Zhou
School of Information Technologies

COMP5338 – Advanced Data Models
Week 1: Big Data, NoSQL and the Polyglot Persistence

Outline
 The Value and limitations of SQL

 Typical Scale-Out Options

 NoSQL Storage Systems

 Polyglot Persistence and Data Lake

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-2

The Relational Model of Data
 Entity-Relationship (ER) Data Model describes data as

 Entities – distinct objects in the domain
 Relationships – between two or more entities

 Entities – described using a set of attributes
 Relationships – have attributes too
 Used for Conceptual Database Design

 Translated to final database implementation

Employee DepartmentWorks_In

TFN

name

desk

since dname

budgetdid

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-3

The Rational RDBMS
 Commercial vendors: Oracle, IBM, Microsoft, …
 Open source systems: MySQL, PostgreSQL, …
 Common features

 Disk-oriented storage;
 Table stored row-by-row on disk
 B-trees as indexing mechanism
 Dynamic locking as the concurrency-control mechanism
 A write-ahead log, or WAL for crash recovery
 SQL as the access language
 A “row-oriented” query optimizer and executor

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-4

http://cacm.acm.org/magazines/2011/6/108651-10-rules-for-scalable-performance-in-simple-operation-datastores/fulltext

The Value of Relational Databases
 Store persistent data

 Storing large amounts of data on disks, while allowing applications to grab
the bits they need through queries

 Application Integration
 Many applications in an enterprise need to share information, which might

happen at the database level
 Concurrency Control

 Database provide transactions to ensure consistent interaction when many
users access the same information at the same time

 Mostly Standard
 Relational model is widely used and understood.
 SQL is the standard language.

 Reporting
 SQL’s simple data model and standardization has made it a foundation for

many reporting tools

http://martinfowler.com/articles/nosql-intro.pdf
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-5

http://martinfowler.com/articles/nosql-intro.pdf

The Scaling Problem of SQL
Relational databases are designed
to run on a single machine, so to
scale, you need buy a bigger
machine or increase capacity of
existing server (scale up)

But it’s cheaper and more effective
to scale out by buying lots of
machines.

http://martinfowler.com/articles/nosql-intro.pdf

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-6

http://martinfowler.com/articles/nosql-intro.pdf

The Fixed Schema Problem of SQL
 In a relational database

 Table structure are predefined
 Tables are related with relationships, which are predefined as well

 Schema evolution in RDBMS has large impact on queries
and applications

 Example
MediaWiki had been through 171 schema versions between April

2003 and November 2007.
 MySQL backend
 ~ 34 tables, ~242 columns, ~700GB in wikipedia (note: 2008 data)

 Schema change has big impact on queries
 Large number of queries could fail due to schema change.

http://yellowstone.cs.ucla.edu/schema-evolution/documents/curino-schema-evolution.pdf

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-7

http://yellowstone.cs.ucla.edu/schema-evolution/documents/curino-schema-evolution.pdf

World of Big Data
 Big Data are high-volume, high-velocity, and/or high-variety

information assets that require new forms of processing to enable
enhanced decision making, insight discovery and process optimization.

[Gartner 2012 report]

 Leaders in database research community identified “big data as a
defining challenging of our time” in a 2013 meeting.

 Three major trends behind big data
 “It has become much cheaper to generate a wide variety of data, due to

inexpensive storage, sensors, smart devices, social software, multiplayer
games and the Internet of Things, …”

 “It has become much cheaper to process large amount data, due to
advances in multicore CPUs, solid state storage, inexpensive cloud
computing, and open source software”

 “data management has become democratized. The process of
generating, processing, and consuming data is no longer just for database
professionals. Decision makers, domain scientists, .. and everyday
consumers now routinely do it”

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-8

D. Abadi, et al. “The Beckman report on database research”. Commun.
ACM 59, 2 (January 2016), 92-99

Schema Change is Unavoidable
 News paper site example

 Early days, for each news article, we may only record the following
information
 Title, author, publishing date and time, actual content

 Gradually we may want to record more about an article
 Keywords, views, geotags, comments, who “favoured” it, who emailed it,

who twittered it … etc

 Evolution of an application is inevitable
 Accept it, incorporate it in the long-term plan for the system
 Pick a system that allows schema evolution or have a strategy

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-9

Outline
 The Value and limitations of SQL

 Typical Scale Out Options

 NoSQL Storage Systems

 Polyglot Persistence and Data Lake

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-10

When Scalability Becomes an Issue
 “Scalability is the capability of a system, network, or

process to handle a growing amount of work, or its potential
to be enlarged in order to accommodate that growth.”

[wikipedia: https://en.wikipedia.org/wiki/Scalability ]

 In database context, the need for scaling occurs when the
size of the database and/or the traffic against it grows to
the point of crossing an optimal level of performance
 Scale up
 Scale out

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-11

https://en.wikipedia.org/wiki/Scalability

Scalability Scenario I
 Persistent Storage Requirements

Medium data size (can fit in one server)
 Typical query workload consists of large number of read request

and relatively low number of write request
 Example: wikipedia

 Only article meta data such as article revision history, articles
relations, user account and setting are stored in core relational
database system (MySQL)

 Article text and images are stored separately
 Key challenge

 Scale to maintain reasonable read latency

http://www.nedworks.org/~mark/presentations/san/Wikimedia%20architecture.pdf

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-12

http://www.nedworks.org/%7Emark/presentations/san/Wikimedia%20architecture.pdf

Scalability Scenario I — Solution
 Master-Slave Replication

 Adopted by many companies
 Also a typical approach to ensure durability

 Example: Wikipedia has one Master database and many
replicas

http://www.nedworks.org/~mark/presentations/san/Wikimedia%20architecture.pdf

Master MySQL

Replica MySQL

replica setAll writes go to the master

They are pushed to all replicas

Reads are load balanced to one of the
replicas

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-13

http://www.nedworks.org/%7Emark/presentations/san/Wikimedia%20architecture.pdf

Scalability Scenario I — Implications
 When the master dies

 One of the replica can be elected as the new master
 Some read may return old data if the latest value has not

been pushed from the master
 It is possible to let Master handle read request for data requiring

strong consistency
 Relatively easy to setup in most RDBMS

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-14

Scalability Scenario II
 Persistent Storage Requirements

Medium or large data size (cannot fit in one server)
 Typical query workload consists of large number of read request

and large number of write request
 Example: flickr.com

 Heavy write traffic: upload photos, adding tags, adding favourite, …
 Over 400,000 photos being added every day. (note: 2007 data)
More than 4 billion queries per day. (note: 2007 data)
 Uses MySQL as backend storage

 Key challenge
 Scale to maintain both read and write latency

http://mysqldba.blogspot.com.au/2008/04/mysql-uc-2007-presentation-file.html
http://highscalability.com/flickr-architecture

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-15

http://mysqldba.blogspot.com.au/2008/04/mysql-uc-2007-presentation-file.html
http://highscalability.com/flickr-architecture

Scalability Scenario II — Solution
 Database Sharding

 The process of slicing a database across multiple machines
Most likely horizontally (e.g., store all data related with a particular

user in one shard)

Shard-aware coordinator

Shard 1 Shard 2 Shard 3
user d,e,fuser a,b,c user g,h,i

Shard 1 replica

Shard 2 replica

Shard 3 replica

Find user a’s photo stream

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-16

Kept in MEMcache:
user_id -> shard id
photo_id-> owner_id

….

Scalability Scenario II – Flickr Example
 User Bob adds User Alice’s photo 123 as “favourite”

 Pulls the photo (123) owner’s account from cache (“Alice”), to get the
shard location
 SHARD-5

 Pulls Bob’s information from cache, to get Bob’s shard location
 SHARD-13

 Start a “distributed transaction”

http://mysqldba.blogspot.com.au/2008/04/mysql-uc-2007-presentation-file.html

User Bob adds
your photo 123
as favourite

User Alice’s
photo 123 is
faved by user
Bob

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-17

http://mysqldba.blogspot.com.au/2008/04/mysql-uc-2007-presentation-file.html

Scalability Scenario II — Implications
 Data have to be de-normalized

 E.g. in the previous example, the “fav” relation is stored in both Bob
and Alice’s record.

 Join is too expensive when data are sharded
 Sometimes join can’t be avoided, e.g. building friends network

 Re-balancing or Re-Sharding is hard
What to do when data do not fit in one shard

 Deciding on a partition factor/plan is hard
May generate hotspots

 See the twitter example on next slides

 Sharding is largely managed outside RDBMS
 Recent version of RDBMS may provide limited support for sharding

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-18

Scalability Scenario II – Twitter Example

 Twitter’s problem
 To store 250 million tweets a day using MySQL

 Twitter’s original Tweet Store:
 Sharding based on time
 Range partition (timestamp range)
 The benefits: shards are filled one by one sequentially
 The downsides: very unbalanced traffic

 Shards with old tweets do not get any traffic

 Twitter’s new Tweet Store:
 Sharding based on random partition (id based)
 A set of in-house systems to manage shards on top of MySQL

http://highscalability.com/blog/2011/12/19/how-twitter-stores-250-million-tweets-a-day-using-mysql.html

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-19

http://highscalability.com/blog/2011/12/19/how-twitter-stores-250-million-tweets-a-day-using-mysql.html

Outline
 The Value and limitations of SQL

 Handling Scalability

 NoSQL Storage Systems

 Polyglot Persistence and Data Lake

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-20

The Coming of NoSQL Storage Systems

 There is no standard definition of NoSQL, the term came up
during a workshop on 2009 with presentations from
Voldemort, Cassandra, Dynomite, HBase,
Hypertable, CouchDB and MongoDB
Means “Not Only SQL”

 Typical features
 They don’t use the relational data model and thus don’t use the SQL

language
 They don’t have fixed schema, allowing you to store any data in any

record
Many of them are designed to run on a cluster

 Manage “sharding”, fault-tolerance, etc. efficiently
Many of them can be integrated with big data processing framework

such as MapReduce
http://martinfowler.com/articles/nosql-intro.pdf

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-21

http://martinfowler.com/articles/nosql-intro.pdf

NoSQL Ecosystem — Scalability
 Distributed NoSQL systems

 Designed to run on a cluster
 Support automatically partitioning data across multiple machines
Machines can add or leave a running cluster
 Handles failover, fault-tolerance

 Example Distributed NoSQL systems
 HBase, Cassandra,…

 Non-distributed NoSQL systems
 Designed to run on a single machine
 Some has limited support for replication and sharding
 Schema-less and “object” support

 Example
MongoDB, Neo4j, etc..

http://www.rackspace.com/blog/nosql-ecosystem/
COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-22

http://www.rackspace.com/blog/nosql-ecosystem/

NoSQL Ecosystem – Data Models
 Document store

 Has “table” like concept
 Each “record” in a “table” is a semi-structured document
 Examples: MongoDB, CouchDB

 Column based store
 Inspired by Google’s Bigtable structure
 Has “table” like concept
 Storage is organized around column family instead of “row”
 Examples: Hbase, Cassandra

 Key Value Store
 Inspired by Amazon’s Dynamo storage
 The overall storage is structured like a big hash table
 May or may not have a “table” concept
 Redis, Memcached, Voldemort, S3, Cassandra, DynamnoDB

 Graph model
 Storage is organized as nodes and edges
 Neo4j

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-23

Outline
 The Value and limitations of SQL

 Handling Scalability

 NoSQL Storage Systems

 Polyglot Persistence and Data Lake

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-24

Polyglot Persistence
 SQL is still useful in many ways

 The relational model fits many real world data
 ACID transactions are desirable in certain cases

 Most NoSQL systems have no or very limited transactional support
 Supporting tools
 Standard query API

 Polyglot persistence is quite common in large enterprise
 using multiple data storage technologies, chosen based upon the

way data is being used by individual applications.

http://martinfowler.com/articles/nosql-intro.pdf

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-25

http://martinfowler.com/articles/nosql-intro.pdf

Data Lake
 Data Lake is a term coined in 2015 referring to a collection

of datasets in various formats
Most organizations manage large number of datasets generated or

collected from different sources
 Structured files, databases, spreadsheets, etc

 It also refers to the analytic tools/services build on top of it.
 There are tools for managing data lake in house or in the

cloud
 Google GOODS, Azure Data Lake Storage, etc

 This is still in early stage, with no standard requirement

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-26

Google’s GOODS

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-27

Halevy, Alon Y., et al. “Managing Google’s data lake: an overview of the Goods system.” IEEE
Data Eng. Bull. 39.3 (2016): 5-14.

Hype Cycle for Data Management, 2017

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-28

https://www.gartner.com/newsroom/id/3809163

In Conclusion
 The rise of NoSQL databases marks the end of the era

of relational database dominance
 But NoSQL databases will not become the new dominators.

Relational will still be popular, and used in the majority of
situations. They, however, will no longer be the automatic
choice.

 Data lake allows data to be stored in various ways,
relational, nonrelational and also raw format as files

http://martinfowler.com/nosql.html

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-29

References
 Paramod J. Sadalage and Martin Fowler, “NoSQL Distilled”, Addison-Wesley Professional; 1 edition

(August 18, 2012)
 Chapter 1

 Daniel Abadi, Rakesh Agrawal, Anastasia Ailamaki, Magdalena Balazinska, Philip A. Bernstein, Michael
J. Carey, Surajit Chaudhuri, Surajit Chaudhuri, Jeffrey Dean, AnHai Doan, Michael J. Franklin,
Johannes Gehrke, Laura M. Haas, Alon Y. Halevy, Joseph M. Hellerstein, Yannis E. Ioannidis, H. V.
Jagadish, Donald Kossmann, Samuel Madden, Sharad Mehrotra, Tova Milo, Jeffrey F. Naughton,
Raghu Ramakrishnan, Volker Markl, Christopher Olston, Beng Chin Ooi, Christopher Ré, Dan Suciu,
Michael Stonebraker, and Todd Walter, Jennifer Widom. 2016. The Beckman report on database
research. Commun. ACM 59, 2 (January 2016), 92-99. DOI=http://dx.doi.org/10.1145/2845915

 Curino, Carlo A., Hyun J. Moon, Letizia Tanca, and Carlo Zaniolo. “Schema Evolution In Wikipedia.”
ICEIS, 2008.
 http://yellowstone.cs.ucla.edu/schema-evolution/documents/curino-schema-evolution.pdf

 High Scalability post: “How Twitter Stores 250 Million Tweets A Day Using MySQL”, 2011
 http://highscalability.com/blog/2011/12/19/how-twitter-stores-250-million-tweets-a-day-using-mysql.html

 Dathan Pattishall, “Federation at Flickr: Doing Billions of Queries per Day”, 2007
 http://www.scribd.com/doc/2592098/DVPmysqlucFederation-at-Flickr-Doing-Billions-of-Queries-Per-Day

 Halevy, Alon Y., et al. “Managing Google’s data lake: an overview of the Goods
system.” IEEE Data Eng. Bull. 39.3 (2016): 5-14.

 Gartner Reveals the 2017 Hype Cycle for Data Management
 https://www.gartner.com/newsroom/id/3809163

COMP5338 “Advanced Data Models” – 2018 (Y. Zhou) 01-30

http://yellowstone.cs.ucla.edu/schema-evolution/documents/curino-schema-evolution.pdf
http://highscalability.com/blog/2011/12/19/how-twitter-stores-250-million-tweets-a-day-using-mysql.html

COMP5338 – Advanced Data Models
Outline
The Relational Model of Data
The Rational RDBMS
The Value of Relational Databases
The Scaling Problem of SQL
The Fixed Schema Problem of SQL
World of Big Data
Schema Change is Unavoidable
Outline
When Scalability Becomes an Issue
Scalability Scenario I
Scalability Scenario I — Solution
Scalability Scenario I — Implications
Scalability Scenario II
Scalability Scenario II — Solution
Scalability Scenario II – Flickr Example
Scalability Scenario II — Implications
Scalability Scenario II – Twitter Example
Outline
The Coming of NoSQL Storage Systems
NoSQL Ecosystem — Scalability
NoSQL Ecosystem – Data Models
Outline
Polyglot Persistence
Data Lake
Google’s GOODS
Hype Cycle for Data Management, 2017
In Conclusion
References

Related Posts