MIT CSAIL / Intel Labs
Intel Labs
Alexander van Renen TUM
Alfons UM
Copyright By PowCoder代写 加微信 powcoder
Recent advancements in learned index structures propose replacing existing index structures, like B-Trees, with approximate learned models. In this work, we present a unified benchmark that com- pares well-tuned implementations of three learned index structures against several state-of-the-art “traditional” baselines. Using four real-world datasets, we demonstrate that learned index structures can indeed outperform non-learned indexes in read-only in-memory workloads over a dense array. We investigate the impact of caching, pipelining, dataset size, and key size. We study the performance profile of learned index structures, and build an explanation for why learned models achieve such good performance. Finally, we investi- gate other important properties of learned index structures, such as their performance in multi-threaded systems and their build times.
PVLDB Reference Format:
, , Alexander van Renen, , , , , and . Benchmarking Learned Indexes. PVLDB, 14(1): 1 – 13, 2021. doi:10.14778/3421424.3421425
PVLDB Artifact Availability:
The source code, data, and/or other artifacts have been made available at https://learned.systems/sosd.
1 INTRODUCTION
While index structures are one of the most well-studied components of database management systems, recent work [11, 18] provided a new perspective on this decades-old topic, showing how machine learning techniques can be used to develop so-called learned index structures. Unlike their traditional counterparts (e.g., [9, 14, 15, 19, 30, 32]), learned index structures build an explicit model of the underlying data to provide effective indexing.
Since learned index structures have been proposed, they have been criticized [25, 26]. These criticisms were motivated by the lack of an efficient open-source implementation of the learned index structure,
This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Visit https://creativecommons.org/licenses/by-nc-nd/4.0/ to view a copy of this license. For any use beyond those covered by this license, obtain permission by emailing Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment.
Proceedings of the VLDB Endowment, Vol. 14, No. 1 ISSN 2150-8097. doi:10.14778/3421424.3421425
inadequate data-sets, and the lack of a standardized benchmark suite to ensure a fair comparison between the different approaches.
Even worse, the lack of an open-source implementation forced researchers to re-implement the techniques of [18], or use back-of- the-envelop calculations, to compare against learned index structures. While not a bad thing per se, it is easy to leave the baseline unopti- mized, or make other unrealistic assumptions, even with the best of intentions, potentially rendering the main takeaways void.
For example, recently Ferragina and Vinciguerra proposed the PGM index [12], a learned index structure with interesting theoret- ical properties, which is recursively built bottom-up. Their experi- mental evaluation showed that the PGM index was strictly superior to traditional indexes as well as their own implementation of the orig- inal learned index [18]. This strong result surprised the authors of [18], who had experimented with bottom-up approaches and found them to be slower (see Section 3.4 for a discussion why this may be case). This motivated us to investigate if the results of [12] would hold against tuned implementations of [18] and other structures.
Further complicating matters, learned structures have an “unfair” advantage on synthetic datasets, as synthetic datasets are often sur- prisingly easy to learn. Hence, it is often easy to show that a learned structure outperforms the traditional approaches just by using the right data. While this is true for almost any benchmark, it is more pronounced for learned algorithms and data structures, as their entire goal is to automatically adjust to the data distribution / workload.
In this paper, we try to address these problems on three fronts: (1) we provide a first open-source implementation of RMIs for researchers to compare against and improve upon, (2) we created a repository of several real-world datasets and workloads for testing, and (3) we created a benchmarking suite, which makes it easy to compare against learned and traditional index structures. To avoid comparing against weak baselines, our open-source benchmarking suite [4] contains implementations of index structures that are either widespread, tuned by their original authors, or both.
Understanding learned indexes. In addition to providing an open source benchmark for use in future research, we also tried to achieve a deeper understanding of learned index structures, extending the work of [16]. First, we present a Pareto analysis of three recent learned index structures (RMIs [18], PGM [12], and RS [17]) and
Benchmarking Learned Indexes
A query for a particular key Data
Lookup Key
(e.g., 72)
1 3 9 12 56
Search bound 57 58 95 98
Given a valid search bound, a search function (e.g., binary search) is used to locate the correct index within the search bound.
An index structure maps the lookup
key to a search bound, which must 99 contain the correct index.
Index Structure
Data Relative position 1 0.1
CDF Function
1.0 0.8 0.6 0.4 0.2 0.0
0 20 40 60 80 100 CDF Input (key)
Approximation CDF
Figure 1: Index structures map each lookup key to a search bound. This search bound must contain the “lower bound” of the key (i.e., the smallest key ≥ the lookup key). The depicted search bound is valid for the lookup key 72 because the key 95 is in the bound. A search function, such as binary search, is used to locate the correct index within the search bound.
several traditional index structures, including trees, tries, and hash ta- bles. We show that, in a warm-cache, tight-loop setting, all three vari- ants of learned index structures can provide better performance/size tradeoffs than several traditional index structures. We extend this analysis to multiple dataset sizes, 32 and 64-bit integers, and differ- ent search techniques (i.e., binary, linear, interpolation).
Second, we analyze why learned index structures achieve such good performance. While we were unable to find a single metric that fully explains the performance of an index structure (it seems intuitive that such a metric does not exist), we offer a statistical anal- ysis of performance counters and other properties. The single most important explanatory variable was cache misses, although cache misses alone are not enough for a statistically significant explana- tion. Surprisingly, we found that branch misses do not explain why learned index structures perform better than traditional structures, as originally claimed in [18]. In fact, we found that both learned index structures and traditional index structures use branching efficiently.
Third, we analyze the performance of index structures in the presence of memory fences, cold caches, and multi-threaded en- vironments, to test behavior under more realistic settings. In all scenarios, we found that learned approaches perform well.
However, our study is not without its limitations. We focused only on read-only workloads, and we tested each index structure in isolation (e.g., a lookup loop, not with integration into any broader application). While this certainly does not cover all po- tential use cases, in-memory performance is increasingly important, and many write-heavy DBMS architectures are also moving towards immutable read-only data-structures (for example, see LSM-trees in RocksDB [3, 20]). Hence, we believe our benchmark can serve as a foundation for mixed read/write workloads and the next generation of learned index structures which supports writes [10, 12, 13].
2 FORMULATION & DEFINITIONS
As depicted in Figure 1, we define an index structure 𝐼 over a zero- indexed sorted array 𝐷 as a mapping between an integer lookup key 𝑥 ∈ Z and a search bound (𝑙𝑜,h𝑖) ∈ (Z+ × Z+), where Z+ is the positive integers and zero:
𝐼 : Z → (Z+ × Z+)
Figure 2: The cumulative distribution function (CDF) view of a sorted array.
We consider only sorted data and integer keys. We assume that data is stored in a way supporting fast random access (e.g., an array). Search bounds are indexes into 𝐷. A valid index structure maps any possible lookup key 𝑥 to a bound that contains the “lower bound” of 𝑥 : the smallest key in 𝐷 that is greater than or equal to 𝑥 . Formally,
we define the lower bound of a key 𝑥, 𝐿𝐵(𝑥), as: 𝐿𝐵(𝑥)=𝑖↔[︁𝐷𝑖 ≥𝑥∧¬∃𝑗(𝑗<𝑖∧𝐷𝑗 ≥𝑥)]︁
As a special case, we define the lower bound of any key greater than or equal to the largest key in 𝐷 as one more than the size of 𝐷: 𝐿𝐵(max𝐷) = |𝐷|. Our definition of “lower bound” corresponds to the C++ standard [1]. We say that an index structure is valid if and only if it produces search bounds that contain the lower bound for every possible lookup key.
∀𝑥∈Z[𝐼(𝑥)=(𝑙𝑜,h𝑖)→𝐷𝑙𝑜 ≤𝐿𝐵(𝑥)≤𝐷h𝑖]
Intuitively, this view of index structures corresponds to an approx- imate index, an index that returns a search range instead of the exact position of a key [7, 18]. Given a valid index, the actual index of the lower bound for a lookup key is located via a “last mile” search (e.g., binary search). This last mile search only needs to examine the keys within the provided search bound (e.g., Figure 1).
2.1 Approximating the CDF
Learned index structures use machine learning techniques ranging from deep neural networks to simple regression in order to model the cumulative distribution function, or CDF, of a sorted array [18]. Here, we use the term CDF to mean the function mapping keys to their relative position in an array. This is strongly connected to the traditional interpretation of the CDF from statistics: the CDF of a particular key 𝑥 is the proportion of keys less than 𝑥 . Figure 2 shows the CDF for some example data.
Given the CDF of a dataset, finding the lower bound of a lookup key 𝑥 in a dataset 𝐷 with a CDF 𝐶𝐷𝐹𝐷 is trivial: one simply com- putes 𝐶𝐷𝐹𝐷 (𝑥) × |𝐷|. Learned index structures function by approx- imating the CDF of the dataset using learned models (e.g., linear regressions). Of course, such learned models are never entirely accu- rate. For example, the blue line in Figure 2 represents one possible imperfect approximation of the CDF. While imperfect, this approxi- mation has a bounded error: the largest deviation from the blue line to the actual CDF occurs at key 12, which has a true CDF value
CDF Output (relative position)
Lookup key: 4718310
1011 1000 0100 11112
01234567 Radixtable Pointer
Spline point CDF
... Stage 2
Figure 3: A recursive model index (RMI). The linear model (stage 1) makes a coarse-grained prediction. Based on this, one of the cubic models (stage 2) makes a refined prediction.
of 0.4 but an approximated value of 0.24. The maximum error of this approximation is thus 0.4 − 0.24 = 0.16 (some adjustments may be required for lookups of absent keys). Given this approximation function 𝐴 and the maximum error of 𝐴, we can define an index structure 𝐼𝐴 as such:
𝐼𝐴(𝑥) = (𝐴(𝑥) − |𝐷| × 0.16,𝐴(𝑥) + |𝐷| × 0.16)
Note that this technique, while utilizing approximate machine learning techniques, never produces an incorrect search bound.
One can view a B-Tree as a way of memorizing the CDF function for a given dataset: a B-Tree in which every 𝑛th key is inserted can be viewed as an approximate index with an error bound of 𝑛 − 1. At one extreme, if every key is inserted into the B-Tree, the B-Tree perfectly maps any possible lookup key to its position in the underlying data (an error bound of zero). Instead, one can insert every other key into a B-Tree in order to reduce the size of the index. This results in a B-Tree with an error bound of one: any location given by the B-Tree can be off by at most one position.
3 LEARNED INDEX STRUCTURES
In this work, we evaluate the performance of three different learned index structures: recursive model indexes (RMI), radix spline in- dexes (RS), and piecewise geometric model indexes (PGM). We do not compare with a number of other learned index struc- tures [10, 13, 23] because tuned implementations could not be made publicly available. While all three of these techniques approximate the CDF of the underlying data, the way these approximations are constructed vary. We next give a high-level overview of each tech- nique, followed by a discussion of their differences.
3.1 Recursive model indexes (RMI)
Originally presented by Kraska et al. [18], RMIs use a multi-stage model, combining simpler machine learning models together. For example, as depicted in Figure 3, an RMI with two stages, a linear stage and a cubic stage, would first use a linear model to make an initial prediction of the CDF for a specific key (stage 1). Then, based on that prediction, the RMI would select one of several cubic models to refine this initial prediction (stage 2).
Structure. When all keys can fit in memory, RMIs with more than two stages are almost never required [21]. Thus, here we explain only two-stage RMIs for simplicity. See [18] for a generalization to 𝑛 stages. A two-stage RMI is a CDF approximator 𝐴 trained on |𝐷 | data points (key / index pairs). The RMI approximator 𝐴 is composed of a single first stage model 𝑓1, and 𝐵 second-stage models 𝑓2𝑖. The value 𝐵 is referred to as the “branching factor” of the RMI.
Figure 4: A radix spline index. A linear spline is used to approx- imate the CDF of the data. Prefixes of resulting spline points are indexed in a radix table to accelerate the search on the spline.
Formally, the RMI is defined as:
𝐴(𝑥) = 𝑓 ⌊𝐵×𝑓1 (𝑥)/|𝐷 |⌋ (𝑥) (1)
Intuitively, the RMI first uses the stage-one model 𝑓1(𝑥) to com-
pute a rough approximation of the CDF of the input key 𝑥. This coarse-grained approximation is then scaled between 0 and the branching factor 𝐵, and this scaled value is used to select a model from the second stage, 𝑓2𝑖 (𝑥). The selected second-stage model is used to produce the final approximation. The stage-one model 𝑓1(𝑥) can be thought of as partitioning the data into 𝐵 buckets, and each second-stage model 𝑓2𝑖 (𝑥 ) is responsible for approximating the CDF of only the keys that fall into the 𝑖th bucket.
Choosing the correct models for both stages (𝑓1 and 𝑓2) and se- lecting the best branching factor for a particular dataset depends on the desired memory footprint of the RMI as well as the underlying data. In this work, we the CDFShop RMI auto-tuner [21].
Training. Let (𝑥,𝑦) ∈ 𝐷 be the set of key / index pairs in the underlying data. Then, an RMI is trained by adjusting the parameters contained in 𝑓1 (𝑥 ) and 𝑓2𝑖 (𝑥 ) to minimize:
∑︂ (𝐹(𝑥)−𝑦)2 (2) (𝑥,𝑦)∈𝐷
Intuitively, minimizing Equation 2 is done by training “top down”: first, the stage one model is trained, and then each stage 2 model is trained to fine-tune the prediction. Details can be found in [18].
3.2 Radix spline indexes (RS)
An RS index [17] consists of a linear spline [24] that approximates the CDF of the data and a radix table that indexes spline points (Figure 4). RS is built in a bottom-up fashion. Uniquely, RS can be built in a single pass with a constant worst-case cost per element (PGM provides a constant amortized cost).
Structure. As depicted in Figure 4, RS consists of a radix table and a set of spline points that define a linear spline over the CDF of the data. The radix table indexes 𝑟 -bit prefixes of the spline points and serves as an approximate index over the spline points. Its purpose is to accelerate binary searches over the spline points. The radix table is represented as an array containing 2𝑟 offsets into the sorted array of spline points. The spline points themselves are represented as key / index pairs. To locate a key in a spline segment, linear interpolation between the two spline points is used.
Data 1 3 9 12 56 57 58 95 98 99
PGM Level 1
Key: 1 Model: f1
Key: 56 Model: f2 56
Key: 95 Model: f3
PGM Level 2
Key: 1 Model: f4
Key: 56 Model: f5
Figure 5: A piecewise geometric model (PGM) index.
Using the example in Figure 4, a lookup in RS works as follows: First, the 𝑟 most significant bits 𝑏 of the lookup key are extracted (𝑟 = 3 and 𝑏 = 101). Then, the extracted bits 𝑏 are used as an offset into the radix table to retrieve the offsets stored at the 𝑏th and the 𝑏 + 1th position (e.g., the 5th and the 6th position). Next, RS performs a binary search between the two offsets on the sorted array of spline points to locate the two spline points that encompass the lookup key. Once the relevant spline segment has been identified, it uses linear interpolation between the two spline points to estimate position of the lookup key in the underlying data.
Training. To build the spline layer, RS uses a one-pass spline fitting algorithm [24] that is similar to the shrinking cone algorithm of FITing-Tree [13]. The spline algorithm guarantees a user-defined error bound. At a high level, whenever the current error corridor exceeds the user-supplied bound, a new spline point is created. When- ever the spline algorithm encounters a new 𝑟 -bit prefix, a new entry is inserted into the pre-allocated radix table.
RS has only two hyperparameters (spline error and number of radix bits), which makes it straightforward to tune. In practice, few configurations need to be tested to reach a desired performance / size tradeoff on a given dataset [17].
3.3 Piecewise geometric model indexes (PGM)
The PGM index is a multi-level structure, where each level represents an error-bounded piecewise linear regression [12]. An example PGM index is depicted in Figure 5. In the first level, the data is partitioned into three segments, each represented by a simple linear model (𝑓1, 𝑓2, 𝑓3). By construction, each of these linear models predicts the CDF of keys in their corresponding segments to within a preset error bound. The partition boundaries of this first level are then treated as their own sorted dataset, and another error-bounded piecewise linear regression is computed. This is repeated until the top level of the PGM is sufficiently small.
Structure. A piecewise linear regression partitions the data into 𝑛 + 1 segments with a set of points 𝑝0,𝑝1,...,𝑝𝑛. The entire piecewise linear regression is expressed as a piecewise function:
⎧⎪𝑎×𝑥+𝑏 if𝑥<𝑝 ⎪000
⎪𝑎 ×𝑥+𝑏 if𝑥≥𝑝 and𝑥<𝑝 ⎪1101
Each regression in the PGM index is constructed with a fixed error bound 𝜖. Such a regression can trivially be used as an approximate index. PGM indexes apply this trick recursively, first building an error-bounded piecewise regression model over the underlying data, then building another error-bounded piecewise regression model over the partitioning points of the first regression. Key lookups are performed by searching each index layer until the regression over the underlying data is reached.
Training. Each regression is constructed optimally, in the sense that the fewest pieces are used to achieve a preset maximum error. This can be done quickly using the approach of [31]. The first regression is performed on the underlying data, resulting in a set of split points (the boundaries of each piece of the regression) and regression co- efficients. These split points are then treated as if they were a new dataset, and the process is repeated, resulting in fewer and fewer pieces at each level. Since each piecewise linear regression contains the few
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com