留学生作业代写 What’s Really New with NewSQL?

What’s Really New with NewSQL?
University
A new class of database management systems (DBMSs) called NewSQL tout their ability to scale modern on-line transac- tion processing (OLTP) workloads in a way that is not possible with legacy systems. The term NewSQL was first used by one of the authors of this article in a 2011 business analysis report discussing the rise of new database systems as challengers to these established vendors (Oracle, IBM, Microsoft). The other author was working on what became one of the first examples of a NewSQL DBMS. Since then several companies and re- search projects have used this term (rightly and wrongly) to describe their systems.
Given that relational DBMSs have been around for over four decades, it is justifiable to ask whether the claim of NewSQL’s superiority is actually true or whether it is simply marketing. If they are indeed able to get better performance, then the next question is whether there is anything scientifically new about them that enables them to achieve these gains or is it just that hardware has advanced so much that now the bottlenecks from earlier years are no longer a problem.

To do this, we first discuss the history of databases to under- stand how NewSQL systems came about. We then provide a detailed explanation of what the term NewSQL means and the different categories of systems that fall under this definition.
1. A BRIEF HISTORY OF DBMSS
The first DBMSs came on-line in the mid 1960s. One of the first was IBM’s IMS that was built to keep track of the supplies and parts inventory for the Saturn V and Apollo space explo- ration projects. It helped introduce the idea that an applica- tion’s code should be separate from the data that it operates on. This allows developers to write applications that only focus on the access and manipulation of data, and not the complications and overhead associated with how to actually perform these operations. IMS was later followed by the pioneering work in the early 1970s on the first relational DBMSs, IBM’s System R and the University of California’s INGRES. INGRES was soon adopted at other universities for their information sys- tems and was subsequently commercialized in the late 1970s. Around the same time, Oracle released the first version of their DBMS that was similar to System R’s design. Other compa- nies were founded in the early 1980s that sought to repeat the success of the first commercial DBMSs, including Sybase and Informix. Although IBM never made System R available to the public, it later released a new relational DBMS (DB2) in 1983 that used parts of the System R code base.

451 Research
The late 1980s and early 1990s brought about a new class of DBMSs that were designed to overcome the much touted impedance mismatch between the relational model and object- oriented programming languages [65]. These object-oriented DBMSs, however, never saw wide-spread market adoption be- cause they lacked a standard interface like SQL. But many of the ideas from them were eventually incorporated in rela- tional DBMSs when the major vendors added object and XML support a decade later, and then again in document-oriented NoSQL systems over two decades later.
The other notable event during the 1990s was the start of today’s two major open-source DBMS projects. MySQL was started in Sweden in 1995 based on the earlier ISAM-based mSQL system. PostgreSQL began in 1994 when two Berke- ley graduate students forked the original QUEL-based Post- gres code from the 1980s to add support for SQL.
The 2000s brought the arrival of Internet applications that had more challenging resource requirements than applications from previous years. They needed to scale to support large number of concurrent users and had to be on-line all the time. But the database for these new applications was consistently found to be a bottleneck because the resource demands were much greater than what DBMSs and hardware could support at the time. Many tried the most obvious option of scaling their DBMS vertically by moving the database to a machine with better hardware. This, however, only improves perfor- mance so much and has diminishing returns. Furthermore, moving the database from one machine to another is a com- plex process and often requires significant downtime, which is unacceptable for these Web-based applications. To overcome this problem, some companies created custom middleware to shard single-node DBMSs over a cluster of less expensive ma- chines. Such middleware presents a single logical database to the application that is stored across multiple physical nodes. When the application issues queries against this database, the middleware redirects and/or rewrites them to distribute their execution on one or more nodes in the cluster. The nodes exe- cute these queries and send the results back to the middleware, which then coalesces them into a single response to the ap- plication. Two notable examples of this middleware approach were eBay’s Oracle-based cluster [53] and Google’s MySQL- based cluster [54]. This approach was later adopted by Face- book for their own MySQL cluster that is still used today.
Sharding middleware works well for simple operations like reading or updating a single record. It is more difficult, how- ever, to execute queries that update more than one record in a transaction or join tables. As such, these early middleware
SIGMOD Record, June 2016 (Vol. 45, No. 2)

systems did not support these types of operations. eBay’s mid- dleware in 2002, for example, required their developers to im- plement all join operations in application-level code.
Eventually some of these companies moved away from us- ing middleware and developed their own distributed DBMSs. The motivation for this was three-fold. Foremost was that tra- ditional DBMSs at that time were focused on consistency and correctness at the expense of availability and performance. But this trade-off was deemed inappropriate for Web-based appli- cations that need to be on-line all the time and have to sup- port a large number of concurrent operations. Secondly, it was thought that there was too much overhead in using a full- featured DBMS like MySQL as a “dumb” data store. Like- wise, it was also thought that the relational model was not the best way to represent an application’s data and that using SQL was an overkill for simple look-up queries.
These problems turned out to be the origin of the impe- tus for the NoSQL1 movement in the mid to late 2000s [22]. The key aspect of these NoSQL systems is that they forgo strong transactional guarantees and the relational model of tra- ditional DBMSs in favor of eventual consistency and alterna- tive data models (e.g., key/value, graphs, documents). This is because it was believed that these aspects of existing DBMSs inhibit their ability to scale out and achieve the high avail- ability that is needed to support Web-based applications. The two most well-known systems that first followed this creed are Google’s BigTable [23] and Amazon’s Dynamo [26]. Nei- ther of these two systems were available outside of their re- spective company at first (although they are now as cloud ser- vices), thus other organizations created their own open source clones of them. These include Facebook’s Cassandra (based on BigTable and Dynamo) and PowerSet’s Hbase (based on BigTable). Other start-ups created their own systems that were not necessarily copies of Google’s or Amazon’s systems but still followed the tenets of the NoSQL philosophy; the most well-known of these is MongoDB.
By the end of the 2000s, there was now a diverse set of scal- able and more affordable distributed DBMSs available. The advantage of using a NoSQL system (or so people thought) was that developers could focus on the aspects of their ap- plication that were more beneficial to their business or orga- nization, rather than having to worry about how to scale the DBMS. Many applications, however, are unable to use these NoSQL systems because they cannot give up strong transac- tional and consistency requirements. This is common for en- terprise systems that handle high-profile data (e.g., financial and order processing systems). Some organizations, most no- tably Google [24], have found that NoSQL DBMSs cause their developers to spend too much time writing code to handle in- consistent data and that using transactions makes them more productive because they provide a useful abstraction that is easier for humans to reason about. Thus, the only options available for these organizations were to either purchase a more powerful single-node machine and to scale the DBMS ver- tically, or to develop their own custom sharding middleware that supports transactions. Both approaches are prohibitively expensive and are therefore not an option for many. It is in this environment that brought about NewSQL systems.
1The NoSQL community argues that the sobriquet should now be interpreted as “Not Only SQL”, since some of these systems have since support some dialect of SQL.
2. THE RISE OF NEWSQL
Our definition of NewSQL is that they are a class of mod- ern relational DBMSs that seek to provide the same scalable performance of NoSQL for OLTP read-write workloads while still maintaining ACID guarantees for transactions. In other words, these systems want to achieve the same scalability of NoSQL DBMSs from the 2000s, but still keep the relational model (with SQL) and transaction support of the legacy DBMSs from the 1970–80s. This enables applications to execute a large number of concurrent transactions to ingest new infor- mation and modify the state of the database using SQL (in- stead of a proprietary API). If an application uses a NewSQL DBMS, then developers do not have to write logic to deal with eventually consistent updates as they would in a NoSQL sys- tem. As we discuss below, this interpretation covers a number of both academic and commercial systems.
We note that there are data warehouse DBMSs that came out in the mid-2000s that some people think meet this criteria (e.g., Vertica, Greenplum, Aster Data). These DBMSs target on-line analytical processing (OLAP) workloads and should not be considered NewSQL systems. OLAP DBMSs are focused on executing complex read-only queries (i.e., aggregations, multi- way joins) that take a long time to process large data sets (e.g., seconds or even minutes). Each of these queries can be signif- icantly different than the previous. The applications targeted by NewSQL DBMSs, on the other hand, are characterized as executing read-write transactions that (1) are short-lived (i.e., no user stalls), (2) touch a small subset of data using index lookups (i.e., no full table scans or large distributed joins), and (3) are repetitive (i.e., executing the same queries with differ- ent inputs). Others have argued for a more narrow definition where a NewSQL system’s implementation has to use (1) a lock-free concurrency control scheme and (2) a shared-nothing distributed architecture [57]. All of the DBMSs that we clas- sify as NewSQL in Section 3 indeed share these properties and thus we agree with this assessment.
3. CATEGORIZATION
Given the above definition, we now examine the landscape of today’s NewSQL DBMSs. To simplify this analysis, we will group systems based on the salient aspects of their imple- mentation. The three categories that we believe best represent NewSQL systems are (1) novel systems that are built from the ground-up using a new architecture, (2) middleware that re-implement the same sharding infrastructure that was devel- oped in the 2000s by Google and others, and (3) database-as-a- service offerings from cloud computing providers that are also based on new architectures.
Both authors have previously included alternative storage engines for existing single-node DBMSs in our categorization of NewSQL systems. The most common examples of these are replacements for MySQL’s default InnoDB storage engine (e.g., TokuDB, ScaleDB, Akiban, deepSQL). The advantage of using a new engine is that an organization can get better performance without having to change anything in their ap- plication and still leverage the DBMS’s existing ecosystem (e.g., tools, APIs). The most interesting of these was ScaleDB because it provided transparent sharding underneath the sys- tem without using middleware by redistributing execution be- tween storage engines; the company, however, has since piv- oted to another problem domain. There has been other sim-
SIGMOD Record, June 2016 (Vol. 42, No. 2)

ilar extensions for systems other than MySQL. Microsoft’s in-memory Hekaton OLTP engine for SQL Server integrates almost seamlessly with the traditional, disk-resident tables. Others use Postgres’ foreign data wrappers and API hooks to achieve the same type of integration but target OLAP work- loads (e.g., Vitesse, CitusDB).
We now assert that such storage engines and extensions for single-node DBMSs are not representative of NewSQL sys- tems and omit them from our taxonomy. MySQL’s InnoDB has improved significantly in terms of reliability and perfor- mance, so the benefits of switching to another engine for OLTP applications are not that pronounced. We acknowledge that the benefits from switching from the row-oriented InnoDB engine to a column-store engine for OLAP workloads are more signif- icant (e.g., Infobright, InfiniDB). But in general, the MySQL storage engine replacement business for OLTP workloads is the graveyard of failed database projects.
This category contains the most interesting NewSQL sys- tems for us because they are new DBMSs built from scratch. That is, rather than extending an existing system (e.g., Mi- crosoft’s Hekaton for SQL Server), they are designed from a new codebase without any of the architectural baggage of legacy systems. All of the DBMSs in this category are based on distributed architectures that operate on shared-nothing re- sources and contain components to support multi-node con- currency control, fault tolerance through replication, flow con- trol, and distributed query processing. The advantage of us- ing a new DBMS that is built for distributed execution is that all parts of the system can be optimized for multi-node envi- ronments. This includes things like the query optimizer and communication protocol between nodes. For example, most NewSQL DBMSs are able to send intra-query data directly between nodes rather than having to route them to a central location like with some middleware systems.
Every one of the DBMSs in this category (with the excep- tion of Google Spanner) also manages their own primary stor- age, either in-memory or on disk. This means that the DBMS is responsible for distributing the database across its resources with a custom engine instead of relying on an off-the-shelf dis- tributed filesystem (e.g., HDFS) or storage fabric (e.g., Apache Ignite). This is an important aspect of them because it allows the DBMS to “send the query to the data” rather than “bring the data to the query,” which results in significantly less net- work traffic since transmitting the queries is typically less net- work traffic than having to transmit data (not just tuples, but also indexes and materialized views) to the computation.
Managing their own storage also enables a DBMS to em- ploy more sophisticated replication schemes than what is pos- sible with the block-based replication scheme used in HDFS. In general, it allows these DBMSs to achieve better perfor- mance than other systems that are layered on top of other existing technologies; examples of this include the “SQL on Hadoop” systems like Trafodion [4] and Splice Machine [16] that provide transactions on top of Hbase. As such, we believe that such systems should not be considered NewSQL.
But there are downsides to using a DBMS based on a new architecture. Foremost is that many organizations are wary of adopting technologies that are too new and un-vetted with a large installation base. This means that the number of people that are experienced in the system is much smaller compared
to the more popular DBMS vendors. It also means that an organization will potentially lose access to existing adminis- tration and reporting tools. Some DBMSs, like Clustrix and MemSQL, avoid this problem by maintaining compatibility with the MySQL wire protocol.
Examples: Clustrix [6], CockroachDB [7], Google Span- ner [24], H-Store [8], HyPer [39], MemSQL [11], NuoDB [14], SAP HANA [55], VoltDB [17].
3.2 Transparent Sharding Middleware
There are now products available that provide the same kind of sharding middleware that eBay, Google, Facebook, and other companies developed in the 2000s. These allow an organi- zation to split a database into multiple shards that are stored across a cluster of single-node DBMS instances. Sharding is different than database federation technologies of the 1990s because each node (1) runs the same DBMS, (2) only has a portion of the overall database, and (3) is not meant to be ac- cessed and updated independently by separate applications.
The centralized middleware component routes queries, co- ordinates transactions, as well as manages data placement, repli- cation, and partitioning across the nodes. There is typically a shim layer installed on each DBMS node that communicates with the middleware. This component is responsible for exe- cuting queries on behalf of the middleware at its local DBMS instance and returning results. All together, these allow mid- dleware products to present a single logical database to the application without needing to modify the underlying DBMS.
The key advantage of using a sharding middleware is that they are often a drop-in replacement for an application that is already using an existing single-node DBMS. Developers do not need to make any changes to their application to use the new sharded database. The most common target for mid- dleware systems is MySQL. This means that in order to be MySQL compatible, the middleware must support the MySQL wire protocol. Oracle provides the MySQL Proxy [13] and Fabric [12] toolkits to do this, but others have written their owning protocol handler library to avoid GPL licensing issues.
Although middleware makes it easy for an organization to scale their database out across multiple nodes, such systems still have to use a traditional DBMS on each node (e.g., MySQL, Postgres, Oracle). These DBMSs are based on the disk-oriented architecture that was developed in the 1970s, and thus they cannot use a storage manager or concurrency control scheme that is optimized for memory-oriented storage like in some of the NewSQL systems that are built on new architectures. Previous research has shown that the legacy components of disk-oriented architectures is a significant encumbrance that prevents these traditional DBMSs from scaling up to take ad- vantage of higher CPU core counts and larger memory capac- ities [38]. The middleware approach can also incur redundant query planning and optimization on sharded nodes for com- plex queries (i.e., once at the middleware and once on the in- dividual DBMS nodes), but this does allow each node to apply their own local optimizations on each query.
Examples: AgilData Scalable Cluster 2 [1], MariaDB MaxS- cale [10], ScaleArc [15], ScaleBase3.
2Prior to 2015, AgilData Cluster was known as dbShards. 3ScaleBase was acquired by ScaleArc in 2015 and is no longer sold.
SIGMOD Record, June 2016 (Vol. 45, No. 2)

3.3 Database-as-a-Service
Lastly, there are cloud computing providers that offer NewSQL database-as-a-service (DBaaS) products. With these services, organizations do not have to maintain the DBMS on either their own private hardware or on a cloud-hosted virtual ma- chine (VM). Instead, the DBaaS provider is responsible for maintaining the physical config

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts