f4: Facebook’s Warm BLOB Storage System
Subramanian Muralidhar, Facebook, Inc.; , University of Southern California and Facebook, Inc.; Sabyasachi Roy, , , , Satadru Pan, , and , Facebook, Inc.;
Linpeng Tang, Princeton University and Facebook, Inc.; , Facebook, Inc.
https://www.usenix.org/conference/osdi14/technical-sessions/presentation/muralidhar
Copyright By PowCoder代写 加微信 powcoder
This paper is included in the Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation. October 6–8, 2014 • Broomfield, CO 978-1-931971-16-4
Open access to the Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation
is sponsored by USENIX.
f4: Facebook’s Warm BLOB Storage System
Subramanian Muralidhar∗, †∗, Sabyasachi Roy∗, ∗, ∗, ∗, Satadru Pan∗, ∗, ∗, Linpeng Tang‡∗, ∗
∗Facebook Inc., †University of Southern California, ‡Princeton University
Facebook’s corpus of photos, videos, and other Binary Large OBjects (BLOBs) that need to be reliably stored and quickly accessible is massive and continues to grow. As the footprint of BLOBs increases, storing them in our traditional storage system, Haystack, is becoming in- creasingly inefficient. To increase our storage efficiency, measured in the effective-replication-factor of BLOBs, we examine the underlying access patterns of BLOBs and identify temperature zones that include hot BLOBs that are accessed frequently and warm BLOBs that are accessed far less often. Our overall BLOB storage sys- tem is designed to isolate warm BLOBs and enable us to use a specialized warm BLOB storage system, f4. f4 is a new system that lowers the effective-replication-factor of warm BLOBs while remaining fault tolerant and able to support the lower throughput demands.
f4 currently stores over 65PBs of logical BLOBs and reduces their effective-replication-factor from 3.6 to either 2.8 or 2.1. f4 provides low latency; is resilient to disk, host, rack, and datacenter failures; and provides sufficient throughput for warm BLOBs.
1. Introduction
As Facebook has grown, and the amount of data shared per user has grown, storing data efficiently has become increasingly important. An important class of data that Facebook stores is Binary Large OBjects (BLOBs), which are immutable binary data. BLOBs are created once, read many times, never modified, and sometimes deleted. BLOB types at Facebook include photos, videos, documents, traces, heap dumps, and source code. The storage footprint of BLOBs is large. As of February 2014, Facebook stored over 400 billion photos.
Haystack [5], Facebook’s original BLOB storage system, has been in production for over seven years and is designed for IO-bound workloads. It reduces the number of disk seeks to read a BLOB to almost always one and triple replicates data for fault tolerance and to support a high request rate. However, as Facebook has grown and evolved, the BLOB storage workload has changed. The types of BLOBs stored have increased. The diversity in size and create, read, and delete rates has increased. And, most importantly, there is now a large and increasing number of BLOBs with low request rates. For these BLOBs, triple replication results in over provisioning
from a throughput perspective. Yet, triple replication also provided important fault tolerance guarantees.
Our newer f4 BLOB storage system provides the same fault tolerance guarantees as Haystack but at a lower effective-replication-factor. f4 is simple, modular, scalable, and fault tolerant; it handles the request rate of BLOBs we store it in; it responds to requests with sufficiently low latency; it is tolerant to disk, host, rack and datacenter failures; and it provides all of this at a low effective-replication-factor.
We describe f4 as a warm BLOB storage system because the request rate for its content is lower than that for content in Haystack and thus is not as “hot.” Warm is also in contrast with cold storage systems [20, 40] that reliably store data but may take days or hours to retrieve it, which is unacceptably long for user-facing requests. We also describe BLOBs using temperature, with hot BLOBs receiving many requests and warm BLOBs receiving few.
There is a strong correlation between the age of a BLOB and its temperature, as we will demonstrate. Newly created BLOBs are requested at a far higher rate than older BLOBs. For instance, the request rate for week-old BLOBs is an order of magnitude lower than for less-than-a-day old content for eight of nine examined types. In addition, there is a strong correlation between age and the deletion rate. We use these findings to inform our design: the lower request rate of warm BLOBs en- ables us to provision a lower maximum throughput for f4 than Haystack, and the low delete rate for warm BLOBs enables us to simplify f4 by not needing to physically re- claim space quickly after deletes. We also use our finding to identify warm content using the correlation between age and temperature.
Facebook’s overall BLOB storage architecture is de- signed to enable warm storage. It includes a caching stack that significantly reduces the load on the storage systems and enables them to be provisioned for fewer requests per BLOB; a transformer tier that handles computational- intense BLOB transformation and can be scaled inde- pendently of storage; a router tier that abstracts away the underlying storage systems and enables seamless migration between them; and the hot storage system, Haystack, that aggregates newly created BLOBs into vol- umes and stores them until their request and delete rates have cooled off enough to be migrated to f4.
USENIX Association 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) 383
User Requests (Browsers, Mobile Devices)
Graph Store
BLOB Storage System
Figure 1: Reading (R1-R4) and creating (C1-C3) BLOBs.
f4 stores volumes of warm BLOBs in cells that use dis- tributed erasure coding, which uses fewer physical bytes than triple replication. It uses Reed-Solomon(10,4) [46] coding and lays blocks out on different racks to ensure resilience to disk, machine, and rack failures within a single datacenter. Is uses XOR coding in the wide-area to ensure resilience to datacenter failures. f4 has been running in production at Facebook for over 19 months. f4 currently stores over 65PB of logical data and saves over 53PB of storage.
Our contributions in this paper include:
• A case for warm storage that informs future research on it and justifies our efforts.
• The design of our overall BLOB storage architecture that enables warm storage.
• The design of f4, a simple, efficient, and fault tolerant warm storage solution that reduces our effective- replication-factor from 3.6 to 2.8 and then to 2.1.
• A production evaluation of f4.
The paper continues with background in Section 2. Section 3 presents the case for warm storage. Section 4 presents the design of our overall BLOB storage archi- tecture that enables warm storage. f4 is described in Section 5. Section 6 covers a production evaluation of f4, Section 7 covers lessons learned, Section 8 covers related work, and Section 9 concludes.
2. Background
This section explains where BLOB storage fits in the full architecture of Facebook. It also describes the different types of BLOBs we store and their size distributions.
2.1 Where BLOB Storage Fits
Figure 1 shows how BLOB storage fits into the overall architecture at Facebook. BLOB creates—e.g., a video upload—originate on the web tier (C1). The web tier writes the data to the BLOB storage system (C2) and then stores the handle for that data into our graph store (C3), Tao [9]. The handle can be used to retrieve or delete
Figure 2: Size distribution for five BLOB types.
the BLOB. Tao associates the handle with other elements of the graph, e.g., the owner of a video.
BLOB reads—e.g., watching a video—also originate on the web tier (R1). The web tier accesses the Graph Store (R2) to find the necessary handles and constructs a URL that can be used to fetch the BLOB. When the browser later sends a request for the BLOB (R3), the request first goes to a content distribution network (CDN) [2, 34] that caches commonly accessed BLOBs. If the CDN does not have the requested BLOB, it sends a request to the BLOB storage system (R4), caches the BLOB, and returns it to the user. The CDN shields the storage system from a significant number of requests on frequently accessed data, and we return to its importance in Sections 4.1.
2.2 BLOBs Explained
BLOBs are immutable binary data. They are created once, read potentially many times, and can only be deleted, not modified. This covers many types of content at Facebook. Most BLOB types are user facing, such as photos, videos, and documents. Other BLOB types are internal, such as traces, heap dumps, and source code. User-facing BLOBs are more prevalent so we focus on them for the remainder of the paper and refer to them as simply BLOBs.
Figure 2 shows the distribution of sizes for five types of BLOBs. There is a significant amount of diversity in the sizes of BLOBs, which has implications for our design as discussed in Section 5.6.
3. The Case for Warm Storage
This section motivates the creation of a warm storage system at Facebook. It demonstrates that temperature zones exist, age is a good proxy for temperature, and that warm content is large and growing.
Methodology The data presented in this section is derived from a two-week trace, benchmarks of existing systems, and daily snapshots of summary statistics. The trace includes a random 0.1% of reads, 10% of creates, and 10% of deletes.
384 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’14) USENIX Association