代写代考 ISBN 978-1-4503-4703-7/18/06. . . $15.00 https://doi.org/10.1145/3183713.31

Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging
We show that all mainstream LSM-tree based key-value stores in the literature and in industry suboptimally trade between the I/O cost of updates on one hand and the I/O cost of lookups and storage space on the other. The reason is that they perform equally expensive merge operations across all levels of LSM-tree to bound the number of runs that a lookup has to probe and to remove obsolete entries to reclaim storage space. With state-of-the-art designs, however, merge operations from all levels of LSM-tree but the largest (i.e., most merge operations) reduce point lookup cost, long range lookup cost, and storage space by a negligible amount while significantly adding to the amortized cost of updates.
To address this problem, we introduce Lazy Leveling, a new de- sign that removes merge operations from all levels of LSM-tree but the largest. Lazy Leveling improves the worst-case complexity of update cost while maintaining the same bounds on point lookup cost, long range lookup cost, and storage space. We further intro- duce Fluid LSM-tree, a generalization of the entire LSM-tree design space that can be parameterized to assume any existing design. Relative to Lazy Leveling, Fluid LSM-tree can optimize more for updates by merging less at the largest level, or it can optimize more for short range lookups by merging more at all other levels.
We put everything together to design Dostoevsky, a key-value store that adaptively removes superfluous merging by navigating the Fluid LSM-tree design space based on the application workload and hardware. We implemented Dostoevsky on top of RocksDB, and we show that it strictly dominates state-of-the-art designs in terms of performance and storage space.

ACM Reference Format:
, . 2018. Dostoevsky: Better Space-Time Trade-Offs for LSM-Tree Based Key-Value Stores via Adaptive Removal of Superfluous Merging . In Proceedings of 2018 International Conference on Management of Data (SIGMOD’18). ACM, , NY, USA, 16 pages. https://doi.org/10. 1145/3183713.3196927
1 INTRODUCTION
Key-Value Stores and LSM-Trees. A key-value store is a data- base that efficiently maps from search keys to their correspond- ing data values. Key-value stores are used everywhere today from graph processing in social media [8, 17] to event log processing in
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from
SIGMOD’18, June 10–15, 2018, Houston, TX, USA
© 2018 Association for Computing Machinery. ACM ISBN 978-1-4503-4703-7/18/06. . . $15.00 https://doi.org/10.1145/3183713.3196927
Figure 1: Dostoevsky enables richer space-time trade-offs among updates, point lookups, range lookups and space- amplification, and it navigates the design space to find the best trade-off for a particular application.
cyber security [18] to online transaction processing [27]. To per- sist key-value entries in storage, most key-value stores today use LSM-tree [41]. LSM-tree buffers inserted/updated entries in main memory and flushes the buffer as a sorted run to secondary storage every time that it fills up. LSM-tree later sort-merges these runs to bound the number of runs that a lookup has to probe and to remove obsolete entries, i.e., for which there exists a more recent entry with the same key. LSM-tree organizes runs into levels of expo- nentially increasing capacities whereby larger levels contain older runs. As entries are updated out-of-place, a point lookup finds the most recent version of an entry by probing the levels from smallest to largest and terminating when it finds the target key. A range lookup, on the other hand, has to access the relevant key range from across all runs at all levels and to eliminate obsolete entries from the result set. To speed up lookups on individual runs, modern designs maintain two additional structures in main memory. First, for every run there is a set of fence pointers that contain the first key of every block of the run; this allows lookups to access a partic- ular key within a run with just one I/O. Second, for every run there exists a Bloom filter; this allows point lookups to skip runs that do not contain the target key. This overall design is adopted in a large number of modern key-value stores including LevelDB [32] and BigTable [19] at Google, RocksDB [29] at Facebook, Cassandra [34], HBase [7] and Accumulo [5] at Apache, Voldemort [38] at LinkedIn, Dynamo [26] at Amazon, WiredTiger [52] at MongoDB, and bLSM [48] and cLSM [31] at Yahoo. Relational databases today such as MySQL (using MyRocks [28]) and SQLite4 support this design too as a storage engine by mapping primary keys to rows as values. The Problem. The frequency of merge operations in LSM-tree con- trols an intrinsic trade-off between the I/O cost of updates on one hand and the I/O cost of lookups and storage space-amplification
, Harvard University
faster updates
Dostoevsky
Mainstream designs strike suboptimal trade-offs Monkey optimizes the Bloom filters allocation Dostoevsky removes superfluous merging
WiredTiger
Cassandra, HBase
RocksDB, LevelDB, cLSM bLSM
faster lookups
update cost
sorted array
point lookup & space costs

(i.e., caused by the presence of obsolete entries) on the other. The problem is that existing designs trade suboptimally among these metrics. Figure 1 conceptually depicts this by plotting point lookup cost and space-amplification on the y-axis against update cost on the x-axis (while these y-axis metrics have different units, their trade-off curves with respect to the x-axis have the same shape). The two points at the edges of the curves are a log and a sorted array. LSM-tree degenerates into these edge points when it does not merge at all or when it merges as much as possible, respectively. We place mainstream systems along the top curve between these edge points based on their default merge frequencies, and we draw a superior trade-off curve for Monkey [22], which represents the current state of the art. We show that there exists an even supe- rior trade-off curve to Monkey. Existing designs forgo a significant amount of performance and/or storage space for not being designed along this bottom curve.
The Problem’s Source. By analyzing the design space of state- of-the-art LSM-trees, we pinpoint the problem to the fact that the worst-case update cost, point lookup cost, range lookup cost, and space-amplification derive differently from across different levels.
• Updates. The I/O cost of an update is paid later through the merge operations that the updated entry participates in. While merge operations at larger levels entail exponentially more work, they take place exponentially less frequently. Therefore, updates derive their I/O cost equally from merge operations across all levels.
• Point lookups. While mainstream designs along the top curve in Figure 1 set the same false positive rate to Bloom filters across all levels of LSM-tree, Monkey, the current state of the art, sets exponentially lower false positive rates to Bloom filters at smaller levels [22]. This is shown to minimize the sum of false positive rates across all filters and to thereby minimize I/O for point lookups. At the same time, this means that access to smaller levels is exponentially less probable. Therefore, most point lookup I/Os target the largest level.
• Long range lookups1. As levels in LSM-tree have exponen- tially increasing capacities, the largest level contains most of the data, and so it tends to contain most of the entries within a given key-range. Therefore, most I/Os issued by long range lookups target the largest level.
• Short range lookups. Range lookups with extremely small key ranges only access approximately one block within each run regardless of the run’s size. As the maximum number of runs per level is fixed in state-of-the-art designs, short range lookups derive their I/O cost equally from across all levels.
• Space-Amplification. The worst-case space-amplification occurs when all entries at smaller levels are updates to entries at the largest level. Therefore, the highest fraction of obsolete entries in the worst-case is at the largest level.
Since the worst-case point lookup cost, long range lookup cost and space-amplification derive mostly from the largest level, merge operations at all levels of LSM-tree but the largest (i.e., most merge operations) hardly improve on these metrics while significantly adding to the amortized cost of updates. This leads to suboptimal trade-offs. We solve this problem from the ground up in three steps.
1In Section 3, we distinguish formally between short and long range lookups.
Solution 1: Lazy Leveling to Remove Superfluous Merging.
We expand the LSM-tree design space with Lazy Leveling, a new de- sign that removes merging from all but the largest level of LSM-tree. Lazy Leveling improves the worst-case cost complexity of updates while maintaining the same bounds on point lookup cost, long range lookup cost, and space-amplification and while providing a competitive bound on short range lookup cost. We show that the improved update cost can be traded to reduce point lookup cost and space-amplification. This generates the bottom curve in Figure 1, which offers richer space-time trade-offs that have been impossible to achieve with state-of-the-art designs until now.
Solution 2: Fluid LSM-Tree for Design Space Fluidity. We in- troduce Fluid LSM-tree as a generalization of LSM-tree that enables transitioning fluidly across the whole LSM-tree design space. Fluid LSM-tree does this by controlling the frequency of merge opera- tions separately for the largest level and for all other levels. Relative to Lazy Leveling, Fluid LSM-tree can optimize more for updates by merging less at the largest level, or it can optimize more for short range lookups by merging more at all other levels.
Solution 3: Dostoevsky to Navigate the Design Space. We put everything together in Dostoevsky: Space-Time Optimized Evol- vable Scalable Key-Value Store. Dostoevsky analytically finds the tuning of Fluid LSM-tree that maximizes throughput for a particular application workload and hardware subject to a user constraint on space-amplification. It does this by pruning the search space to quickly find the best tuning and physically adapting to it dur- ing runtime. Since Dostoevsky spans all existing designs and is able to navigate to the best one for a given application, it strictly dominates existing key-value stores in terms of performance and space-amplification. We depict Dostoevsky in Figure 1 as a black star that can navigate the entire design space.
Contributions. Our contributions are summarized below.
• We show that state-of-the-art LSM-trees perform equally expensive merge operations across all levels of LSM-tree, yet merge operations at all but the largest level (i.e., most merge operations) improve point lookup cost, long range lookup cost, and space-amplification by a negligible amount while adding significantly to the amortized cost of updates.
• We introduce Lazy Leveling to remove merge operations at all but the largest level. This improves the cost complexity of updates while maintaining the same bounds on point lookups, long range lookups, and space-amplification and while providing a competitive bound on short range lookups.
• We introduce Fluid LSM-tree, a generalization of LSM-tree that spans all existing designs. Relative to Lazy Leveling, Fluid LSM-tree can optimize more for updates by merging less at the largest level, or it can optimize more for short range lookups by merging more at all other levels.
• WeintroduceDostoevsky,akey-valuestorethatdynamically adapts across the Fluid LSM-tree design space to the design that maximizes the worst-case throughput based on the ap- plication workload and the hardware subject to a constraint on space-amplification.
• We implemented Dostoevsky on RocksDB and show that it dominates existing designs for any application scenario.

Term Definition Unit
N total number of entries
L number of levels levels
tiering leveling
level 012…012…
buffer Bloom filters fence pointers
sorted runs
72121 35352 16763
entries Lmax maximum possible number of levels levels
B number of entries that fit into a storage block P size of the buffer in storage blocks
T size ratio between adjacent levels
entries blocks
Tlim size ratio at which L converges to 1
M main memory allocated to the Bloom filters bits pi false positive rate for Bloom filters at Level i
s selectivity of a range lookup
R zero-result point lookup cost I/Os V non-zero-result point lookup cost I/Os Q range lookup cost I/Os
W update cost I/Os K bound on number of runs at Levels 1 to L − 1 runs Z bound on number of runs at Level L runs μ storage sequential over random access speed
φ storage write over read speed
Table 1: Table of terms used throughput the paper.
into a new run that gets placed at Level 2. With leveling, the merge also includes the preexisting run at Level 2. We formally study this design space in the next section. We discuss further details of merge mechanics in Appendix D and other log-structured designs in Appendix E.
Number of Levels. The buffer at Level 0 has a capacity of B · P
entries, where B is the number of entries that fit into a storage
block, and P is the size of the buffer in terms of storage blocks. In
general, Level i has a capacity of B · P · T i entries, and the capacity
at the largest level can be approximated as having N · T −1 entries. T
To derive the number of levels, we divide the capacity at the largest level by the capacity of the buffer and take log base T of the quotient, as shown in Equation 1.
L= logT B·P· T (1)
Figure 2: An overview of an LSM-tree using the tiering and leveling merge policies.
2 BACKGROUND
This section gives the necessary background on LSM-tree based key-value stores. Figure 2 illustrates their architecture and Table 1 gives a list of terms we use throughout the paper.
LSM-Tree Structure. LSM-tree optimizes for write-heavy work- loads. This is an important performance goal for systems today because the proportion of application writes is continuously in- creasing (e.g., in 2012 Yahoo! reported that the proportion of writes targeting their web-service was 50% and projected to continue ac- celerating [48]). To optimize for writes, LSM-tree initially buffers all updates, insertions, and deletes (henceforth referred to as up- dates unless otherwise mentioned) in main memory, as shown in Figure 2. When the buffer fills up, LSM-tree flushes the buffer to secondary storage as a sorted run. LSM-tree sort-merges runs in order to (1) bound the number of runs that a lookup has to access in secondary storage, and to (2) remove obsolete entries to reclaim space. It organizes runs into L conceptual levels of exponentially increasing sizes. Level 0 is the buffer in main memory, and runs belonging to all other levels are in secondary storage.
The balance between the I/O cost of merging and the I/O cost of lookups and space-amplification can be tuned using two knobs. The first knob is the size ratio T between the capacities of adjacent levels; T controls the number of levels of LSM-tree and thus the overall number of times that an entry gets merged across levels. The second knob is the merge policy, which controls the number of times an entry gets merged within a level. All designs today use either one of two merge policies: tiering or leveling (e.g., Cassandra and RocksDB use tiering and leveling by default, respectively [6, 29]). With tiering, we merge runs within a level only when the level reaches capacity [33]. With leveling, we merge runs within a level whenever a new run comes in [41]. We compare these policies in Figure 2 (with a size ratio of 4 and a buffer size of one entry2). In both cases, the merge is triggered by the buffer flushing and causing Level 1 to reach capacity. With tiering, all runs at Level 1 get merged
2In practice, the buffer size is typically set between 2 and 64 MB [29, 32], but we use a size of one entry in this example for ease of illustration.
We restrict the size ratio to the domain of 2 ≤ T ≤ Tlim, where T is defined as N . As the size ratio increases and approaches
Tlim, the number of levels decreases and approaches 1. Increasing
T beyond Tlim has no structural impact. Furthermore, restricting T to be 2 or greater ensures that the resulting run from a merge operation at level i is never large enough to move beyond level i + 1. In other words, this ensures that runs do not skip levels. Thus,
the highest possible number of levels L
when the size ratio is set to 2).
Finding Entries. Since entries are updated out-of-place, multiple versions of an entry with the same key may exist across multiple levels (and even across runs within a level with tiering). To ensure that a lookup is always able to find the most recent version of an entry, LSM-tree takes the following measures. (1) When an entry is inserted into the buffer and the buffer already contains an entry with the same key, the newer entry replaces the older one. (2) When two runs that contain an entry with the same key are merged, only the entry from the newer run is kept because it is more recent. (3) To be able to infer the order at which different entries with the same
key across different runs were inserted, a run can only be merged with the next older or the next younger run. Overall, these rules
is ⌈log 􏱌 N · 1 􏱍 ⌉ (i.e., 2 B·P 2
secondary storage main memory

ensure that if there are two runs that contain different versions of the same entry, the younger run contains the newer version. Point Lookups. A point lookup finds the most recent version of an entry by traversing the levels from smallest to largest, and runs within a level from youngest to oldest with tiering. It terminates when it finds the first entry with a matching key.
Range Lookups. A range lookup has to find the most recent ver- sions of all entries within the target key range. It does this by sort-merging the relevant key range across all runs at all levels. While sort-merging, it identifies entries with the same key across different runs and discards older versions.
Deletes. Deletes are supported by adding a one-bit flag to every entry. If a lookup finds that the most recent version of an entry has this flag on, it does not return a value to the application. When a deleted entry is merged with the oldest run, it is discarded as it has replaced all entries with the same key that were inserted prior to it. Fragmented Merging. To smooth out performance slumps due to long merge operations at larger levels, mainstream designs partition runs into files (e.g., 2 to 64 MB [29, 32]) called Sorted String Tables (SSTables), and they merge one SSTable at a time with SSTables with an overlapping key range at the next older run. This technique does not affect the worst-case I/O overhead of merging but only how this overhead gets scheduled across time. For readability throughout the paper, we discuss merge operations as having the granularity of runs, though they can also have the granularity of SSTables. Space-Amplification. The factor by which the presence of obso- lete entries amplify storage space is known as space-amp

程序代写 CS代考加微信: powcoder QQ: 1823890830 Email: powcoder@163.com

Related Posts