Bigtable: A Distributed Storage System for Structured Data
, , , . Hsieh, Deborah A. Burrows, , , .
Bigtable is a distributed storage system for managing structured data that is designed to scale to a very large size: petabytes of data across thousands of commodity servers. Many projects at Google store data in Bigtable, including web indexing, Google Earth, and Google Fi- nance. These applications place very different demands on Bigtable, both in terms of data size (from URLs to web pages to satellite imagery) and latency requirements (from backend bulk processing to real-time data serving). Despite these varied demands, Bigtable has successfully provided a flexible, high-performance solution for all of these Google products. In this paper we describe the sim- ple data model provided by Bigtable, which gives clients dynamic control over data layout and format, and we de- scribe the design and implementation of Bigtable.
1 Introduction
Copyright By PowCoder代写 加微信 powcoder
Over the last two and a half years we have designed, implemented, and deployed a distributed storage system for managing structured data at Google called Bigtable. Bigtable is designed to reliably scale to petabytes of data and thousands of machines. Bigtable has achieved several goals: wide applicability, scalability, high per- formance, and high availability. Bigtable is used by more than sixty Google products and projects, includ- ing Google Analytics, Google Finance, Orkut, Person- alized Search, Writely, and Google Earth. These prod- ucts use Bigtable for a variety of demanding workloads, which range from throughput-oriented batch-processing jobs to latency-sensitive serving of data to end users. The Bigtable clusters used by these products span a wide range of configurations, from a handful to thousands of servers, and store up to several hundred terabytes of data.
In many ways, Bigtable resembles a database: it shares many implementation strategies with databases. Paral- lel databases [14] and main-memory databases [13] have
achieved scalability and high performance, but Bigtable provides a different interface than such systems. Bigtable does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format, and al- lows clients to reason about the locality properties of the data represented in the underlying storage. Data is in- dexed using row and column names that can be arbitrary strings. Bigtable also treats data as uninterpreted strings, although clients often serialize various forms of struc- tured and semi-structured data into these strings. Clients can control the locality of their data through careful choices in their schemas. Finally, Bigtable schema pa- rameters let clients dynamically control whether to serve data out of memory or from disk.
Section 2 describes the data model in more detail, and Section 3 provides an overview of the client API. Sec- tion 4 briefly describes the underlying Google infrastruc- ture on which Bigtable depends. Section 5 describes the fundamentals of the Bigtable implementation, and Sec- tion 6 describes some of the refinements that we made to improve Bigtable’s performance. Section 7 provides measurements of Bigtable’s performance. We describe several examples of how Bigtable is used at Google in Section 8, and discuss some lessons we learned in designing and supporting Bigtable in Section 9. Fi- nally, Section 10 describes related work, and Section 11 presents our conclusions.
2 Data Model
A Bigtable is a sparse, distributed, persistent multi- dimensional sorted map. The map is indexed by a row key, column key, and a timestamp; each value in the map is an uninterpreted array of bytes.
(row:string, column:string, time:int64) → string
Google, Inc.
To appear in OSDI 2006
“contents:” “anchor:cnnsi.com” “anchor:my.look.ca”
“com.cnn.www”
“…” “…”
“…”
Figure 1: A slice of an example table that stores Web pages. The row name is a reversed URL. The contents column family con- tains the page contents, and the anchor column family contains the text of any anchors that reference the page. CNN’s home page is referenced by both the Sports Illustrated and the MY-look home pages, so the row contains columns named anchor:cnnsi.com and anchor:my.look.ca. Each anchor cell has one version; the contents column has three versions, at timestamps t3, t5, and t6.
We settled on this data model after examining a variety of potential uses of a Bigtable-like system. As one con- crete example that drove some of our design decisions, suppose we want to keep a copy of a large collection of web pages and related information that could be used by many different projects; let us call this particular table the Webtable. In Webtable, we would use URLs as row keys, various aspects of web pages as column names, and store the contents of the web pages in the contents: col- umn under the timestamps when they were fetched, as illustrated in Figure 1.
The row keys in a table are arbitrary strings (currently up to 64KB in size, although 10-100 bytes is a typical size for most of our users). Every read or write of data under a single row key is atomic (regardless of the number of different columns being read or written in the row), a design decision that makes it easier for clients to reason about the system’s behavior in the presence of concurrent updates to the same row.
Bigtable maintains data in lexicographic order by row key. The row range for a table is dynamically partitioned. Each row range is called a tablet, which is the unit of dis- tribution and load balancing. As a result, reads of short row ranges are efficient and typically require communi- cation with only a small number of machines. Clients can exploit this property by selecting their row keys so that they get good locality for their data accesses. For example, in Webtable, pages in the same domain are grouped together into contiguous rows by reversing the hostname components of the URLs. For example, we store data for maps.google.com/index.html under the key com.google.maps/index.html. Storing pages from the same domain near each other makes some host and domain analyses more efficient.
Column Families
Column keys are grouped into sets called column fami- lies, which form the basic unit of access control. All data stored in a column family is usually of the same type (we compress data in the same column family together). A column family must be created before data can be stored under any column key in that family; after a family has been created, any column key within the family can be used. It is our intent that the number of distinct column families in a table be small (in the hundreds at most), and that families rarely change during operation. In contrast, a table may have an unbounded number of columns.
A column key is named using the following syntax: family:qualifier. Column family names must be print- able, but qualifiers may be arbitrary strings. An exam- ple column family for the Webtable is language, which stores the language in which a web page was written. We use only one column key in the language family, and it stores each web page’s language ID. Another useful col- umn family for this table is anchor; each column key in this family represents a single anchor, as shown in Fig- ure 1. The qualifier is the name of the referring site; the cell contents is the link text.
Access control and both disk and memory account- ing are performed at the column-family level. In our Webtable example, these controls allow us to manage several different types of applications: some that add new base data, some that read the base data and create derived column families, and some that are only allowed to view existing data (and possibly not even to view all of the existing families for privacy reasons).
Timestamps
Each cell in a Bigtable can contain multiple versions of the same data; these versions are indexed by timestamp. Bigtable timestamps are 64-bit integers. They can be as- signed by Bigtable, in which case they represent “real time” in microseconds, or be explicitly assigned by client
To appear in OSDI 2006
// Open the table
Table *T = OpenOrDie(“/bigtable/web/webtable”);
// Write a new anchor and delete an old anchor RowMutation r1(T, “com.cnn.www”); r1.Set(“anchor:www.c-span.org”, “CNN”); r1.Delete(“anchor:www.abc.com”);
Operation op;
Apply(&op, &r1);
Figure 2: Writing to Bigtable.
Scanner scanner(T);
ScanStream *stream;
stream = scanner.FetchColumnFamily(“anchor”); stream->SetReturnAllVersions(); scanner.Lookup(“com.cnn.www”);
for (; !stream->Done(); stream->Next()) {
applications. Applications that need to avoid collisions must generate unique timestamps themselves. Different versions of a cell are stored in decreasing timestamp or- der, so that the most recent versions can be read first.
To make the management of versioned data less oner- ous, we support two per-column-family settings that tell Bigtable to garbage-collect cell versions automatically. The client can specify either that only the last n versions of a cell be kept, or that only new-enough versions be kept (e.g., only keep values that were written in the last seven days).
In our Webtable example, we set the timestamps of the crawled pages stored in the contents: column to the times at which these page versions were actually crawled. The garbage-collection mechanism described above lets us keep only the most recent three versions of every page.
The Bigtable API provides functions for creating and deleting tables and column families. It also provides functions for changing cluster, table, and column family metadata, such as access control rights.
Client applications can write or delete values in Bigtable, look up values from individual rows, or iter- ate over a subset of the data in a table. Figure 2 shows C++ code that uses a RowMutation abstraction to per- form a series of updates. (Irrelevant details were elided to keep the example short.) The call to Apply performs an atomic mutation to the Webtable: it adds one anchor to www.cnn.com and deletes a different anchor.
Figure 3 shows C++ code that uses a Scanner ab- straction to iterate over all anchors in a particular row. Clients can iterate over multiple column families, and there are several mechanisms for limiting the rows, columns, and timestamps produced by a scan. For ex- ample, we could restrict the scan above to only produce anchors whose columns match the regular expression anchor:*.cnn.com, or to only produce anchors whose timestamps fall within ten days of the current time.
Bigtable supports several other features that allow the user to manipulate data in more complex ways. First, Bigtable supports single-row transactions, which can be used to perform atomic read-modify-write sequences on data stored under a single row key. Bigtable does not cur- rently support general transactions across row keys, al- though it provides an interface for batching writes across row keys at the clients. Second, Bigtable allows cells to be used as integer counters. Finally, Bigtable sup- ports the execution of client-supplied scripts in the ad- dress spaces of the servers. The scripts are written in a language developed at Google for processing data called Sawzall [28]. At the moment, our Sawzall-based API does not allow client scripts to write back into Bigtable, but it does allow various forms of data transformation, filtering based on arbitrary expressions, and summariza- tion via a variety of operators.
Bigtable can be used with MapReduce [12], a frame- work for running large-scale parallel computations de- veloped at Google. We have written a set of wrappers that allow a Bigtable to be used both as an input source and as an output target for MapReduce jobs.
4 Building Blocks
Bigtable is built on several other pieces of Google in- frastructure. Bigtable uses the distributed Google File System (GFS) [17] to store log and data files. A Bigtable cluster typically operates in a shared pool of machines that run a wide variety of other distributed applications, and Bigtable processes often share the same machines with processes from other applications. Bigtable de- pends on a cluster management system for scheduling jobs, managing resources on shared machines, dealing with machine failures, and monitoring machine status.
The Google SSTable file format is used internally to store Bigtable data. An SSTable provides a persistent, ordered immutable map from keys to values, where both keys and values are arbitrary byte strings. Operations are provided to look up the value associated with a specified
printf(“%s %s %lld %s\n”, scanner.RowName(),
stream->ColumnName(),
stream->MicroTimestamp(),
stream->Value());
Figure 3: Reading from Bigtable.
To appear in OSDI 2006
key, and to iterate over all key/value pairs in a specified key range. Internally, each SSTable contains a sequence of blocks (typically each block is 64KB in size, but this is configurable). A block index (stored at the end of the SSTable) is used to locate blocks; the index is loaded into memory when the SSTable is opened. A lookup can be performed with a single disk seek: we first find the appropriate block by performing a binary search in the in-memory index, and then reading the appropriate block from disk. Optionally, an SSTable can be com- pletely mapped into memory, which allows us to perform lookups and scans without touching disk.
Bigtable relies on a highly-available and persistent distributed lock service called Chubby [8]. A Chubby service consists of five active replicas, one of which is elected to be the master and actively serve requests. The service is live when a majority of the replicas are running and can communicate with each other. Chubby uses the Paxos algorithm [9, 23] to keep its replicas consistent in the face of failure. Chubby provides a namespace that consists of directories and small files. Each directory or file can be used as a lock, and reads and writes to a file are atomic. The Chubby client library provides consis- tent caching of Chubby files. Each Chubby client main- tains a session with a Chubby service. A client’s session expires if it is unable to renew its session lease within the lease expiration time. When a client’s session expires, it loses any locks and open handles. Chubby clients can also register callbacks on Chubby files and directories for notification of changes or session expiration.
Bigtable uses Chubby for a variety of tasks: to ensure that there is at most one active master at any time; to store the bootstrap location of Bigtable data (see Sec- tion 5.1); to discover tablet servers and finalize tablet server deaths (see Section 5.2); to store Bigtable schema information (the column family information for each ta- ble); and to store access control lists. If Chubby becomes unavailable for an extended period of time, Bigtable be- comes unavailable. We recently measured this effect in 14 Bigtable clusters spanning 11 Chubby instances. The average percentage of Bigtable server hours during which some data stored in Bigtable was not available due to Chubby unavailability (caused by either Chubby out- ages or network issues) was 0.0047%. The percentage for the single cluster that was most affected by Chubby unavailability was 0.0326%.
5 Implementation
The Bigtable implementation has three major compo- nents: a library that is linked into every client, one mas- ter server, and many tablet servers. Tablet servers can be
dynamically added (or removed) from a cluster to acco- modate changes in workloads.
The master is responsible for assigning tablets to tablet servers, detecting the addition and expiration of tablet servers, balancing tablet-server load, and garbage col- lection of files in GFS. In addition, it handles schema changes such as table and column family creations.
Each tablet server manages a set of tablets (typically we have somewhere between ten to a thousand tablets per tablet server). The tablet server handles read and write requests to the tablets that it has loaded, and also splits tablets that have grown too large.
As with many single-master distributed storage sys- tems [17, 21], client data does not move through the mas- ter: clients communicate directly with tablet servers for reads and writes. Because Bigtable clients do not rely on the master for tablet location information, most clients never communicate with the master. As a result, the mas- ter is lightly loaded in practice.
A Bigtable cluster stores a number of tables. Each ta- ble consists of a set of tablets, and each tablet contains all data associated with a row range. Initially, each table consists of just one tablet. As a table grows, it is auto- matically split into multiple tablets, each approximately 100-200 MB in size by default.
5.1 Tablet Location
We use a three-level hierarchy analogous to that of a B+- tree [10] to store tablet location information (Figure 4).
UserTable1
UserTableN
The first level is a file stored in Chubby that contains the location of the root tablet. The root tablet contains the location of all tablets in a special METADATA table. Each METADATA tablet contains the location of a set of user tablets. The root tablet is just the first tablet in the METADATA table, but is treated specially—it is never split—to ensure that the tablet location hierarchy has no more than three levels.
The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet’s table
Other METADATA tablets
Chubby file
Root tablet (1st METADATA tablet)
Figure 4: Tablet location hierarchy.
To appear in OSDI 2006
identifier and its end row. Each METADATA row stores approximately 1KB of data in memory. With a modest limit of 128 MB METADATA tablets, our three-level lo- cation scheme is sufficient to address 234 tablets (or 261 bytes in 128 MB tablets).
The client library caches tablet locations. If the client does not know the location of a tablet, or if it discov- ers that cached location information is incorrect, then it recursively moves up the tablet location hierarchy. If the client’s cache is empty, the location algorithm requires three network round-trips, including one read from Chubby. If the client’s cache is stale, the location algorithm could take up to six round-trips, because stale cache entries are only discovered upon misses (assuming that METADATA tablets do not move very frequently). Although tablet locations are stored in memory, so no GFS accesses are required, we further reduce this cost in the common case by having the client library prefetch tablet locations: it reads the metadata for more than one tablet whenever it reads the METADATA table.
We also store secondary information in the METADATA table, including a log of all events per- taining to each tablet (such as when a server begins serving it). This information is helpful for debugging and performance analysis.
5.2 Tablet Assignment
Each tablet is assigned to one tablet server at a time. The master keeps track of the set of live tablet servers, and the current assignment of tablets to tablet servers, in- cluding which tablets are unassigned. When a tablet is unassigned, and a tablet server with sufficient room for the tablet is available, the master assigns the tablet by sending a tablet load request to the tablet server.
Bigtable uses Chubby to keep track of tablet servers. When a tablet server starts, it creates, and acquires an exclusive lock on, a uniquely-named file in a specific Chubby directory. The master monitors this directory (the servers
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com