FAST: Fast Architecture Sensitive Tree Search on Modern CPUs and GPUs
Changkyu Kim†, †, †, ⋆, . Nguyen†, ⋆, . Lee†, . Brandt⋄, and †
†Throughput Computing Lab, Intel Corporation
In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous computing power by integrating multiple cores, each with wide vector units. There has been much work to exploit modern processor architectures for database primitives like scan, sort, join and aggregation. However, unlike other primitives, tree search presents significant challenges due to irregular and unpredictable data accesses in tree traversal.
Copyright By PowCoder代写 加微信 powcoder
In this paper, we present FAST, an extremely fast architecture sensitive layout of the index tree. FAST is a binary tree logically organized to optimize for architecture features like page size, cache line size, and SIMD width of the underlying hardware. FAST elimi- nates impact of memory latency, and exploits thread-level and data- level parallelism on both CPUs and GPUs to achieve 50 million (CPU) and 85 million (GPU) queries per second, 5X (CPU) and 1.7X (GPU) faster than the best previously reported performance on the same architectures. FAST supports efficient bulk updates by rebuilding index trees in less than 0.1 seconds for datasets as large as 64M keys and naturally integrates compression techniques, over- coming the memory bandwidth bottleneck and achieving a 6X per- formance improvement over uncompressed index search for large keys on CPUs.
Categories and Subject Descriptors H.2 [Database Management]: Systems General Terms
Performance, Algorithms
1. INTRODUCTION
Tree structured index search is a critical database primitive, used in a wide range of applications. In today’s data warehouse systems, many data processing tasks, such as scientific data mining, network monitoring, and financial analysis require handling large volumes of index search with low-latency and high-throughput.
As memory capacity has increased dramatically over the years, many database tables now reside completely in memory, thus elim-
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
SIGMOD’10, June 6–11, 2010, Indianapolis, Indiana, USA. Copyright 2010 ACM 978-1-4503-0032-2/10/06 …$10.00.
inating disk I/O operations. Modern processors integrate multiple cores in a chip, each with wide vector (SIMD) units. Although memory bandwidth has also been increasing steadily, the band- width to compute ratio is reducing, and eventually memory band- width will become the bottleneck for future scalable performance.
In the database community, there is growing interest in exploit- ing the increased compute in modern processors. Recently, re- searchers have explored speeding up critical primitives like scan, sort, join and aggregation [30, 11, 22, 12]. However, unlike these primitives, index tree search presents significant challenges in uti- lizing the high compute and bandwidth resources. Database search typically involves long latency for the main memory access fol- lowed by small number of arithmetic operations, leading to inef- fective utilization of large number of cores and wider SIMD. This main memory access latency is difficult to hide due to irregular and unpredictable data accesses during the tree traversal.
In this paper, we present FAST (Fast Architecture Sensitive Tree) search algorithm that exploits high compute in modern processors for index tree traversal. FAST is a binary tree, managed as a hier- archical tree whose elements are rearranged based on architecture features like page size, cache line size, and SIMD width of underly- ing hardware. We show how to eliminate the impact of latency with hierarchically blocked tree, software pipelining, and prefetches.
Having eliminated memory latency impact, we show how to ex- tract parallelism to achieve high throughput search on two high per- formance commodity architectures – CPUs and GPUs. We report the fastest search performance on both platforms by utilizing many cores and wide SIMD units. Our CPU search performance on the Core i7 with 64M1 32-bit (key, rid) pairs is 5X faster than the best reported number, achieving a throughput of 50M search queries per second. We achieve peak throughput even within a stringent response time constraint of 1 microsecond. Our GPU search on the GTX 280 is around 1.7X faster than the best reported num- ber, achieving 85M search queries per second. Our high through- put search is naturally applicable to look-up intensive applications like On-Line Analytical Processing (OLAP), DSS and data mining. Also, we can re-build our index trees in less than 0.1 seconds for 64M keys on both CPU and GPU, enabling fast bulk updates.
We compare CPU search and GPU search and provide analytical models to analyze our optimized search algorithms for each plat- form and identify the compute and bandwidth requirements. When measured with various tree sizes from 64K elements to 64M ele- ments, CPU search is 2X faster than GPU search for small trees where all elements can fit in the caches, but 1.7X slower than GPU search for large tree where memory bandwidth limits the perfor-
11M refers to 1 million.
⋆Special Projects Group, ⋄University of California, Oracle Corporation Santa Cruz
mance. We also evaluate FAST search algorithm on the Intel Many- Core Architecture Platform (MICA), a Larrabee [29] based silicon platform for many-core research and software development. By exploiting wider SIMD and large caches, search on the MICA plat- form results in 2.4X – 3X higher throughput than CPU search and 1.8X – 4.4X higher than GPU search.
Search on current GPUs is compute bound and not bandwidth bound. However, CPU search becomes close to memory bandwidth bound as the tree size grows. Compressing the index elements will reduce bandwidth consumption, thereby improving CPU search per- formance. We therefore propose compression techniques for our CPU search that seamlessly handles variable length string keys and integer keys. Our compression extends the commonly used prefix compression scheme to obtain order preserving partial keys with low overhead of false positives. We exploit SSE SIMD execution to speed up compression, compressing 512MB string keys in less than 0.05 seconds for key sizes up to 100B on the Core i7. We in- tegrate our fast SIMD compression technique into the FAST search framework on CPUs and achieve up to 6X further speed up for 100B keys compared to uncompressed index search, by easing the memory bandwidth bottleneck.
Finally, we examine four different search techniques – FAST, FAST with compression, buffered scheme, sort-based scheme – under the constraint of response time. For the response time of 0.001 to 25 ms, FAST on compressed keys provides the maximum throughput of around 60M queries per second while the sort-based scheme achieves higher throughput for the response time of greater than 30 ms. Our architecture sensitive tree search with efficient compression support lends itself well to exploiting the future trends of decreasing bandwidth-to-compute ratio and increasing compute resource with more cores and wider SIMD.
with the same probability. Instead of prefetching all (or a subset of ) children speculatively, our scheme waits until we know the correct child to move to, and only prefetches the correct child. To increase the prefetch distance (i.e., the number of cycles between a prefetch initiation and the use of the prefetched data), we use software pipelining [3]. Note that we do not waste main memory bandwidth due to the issue of unnecessary prefetches speculatively – this is important because memory bandwidth is a critical resource.
Another way of reducing TLB/cache miss overheads is to amor- tize the penalties by processing data in batches [4, 32]. Buffers are attached to nodes or sub-trees and queries are processed in batch. These batched search schemes essentially sacrifice response time to achieve high throughput and can only be used for applications where a large number of query requests arrive in a short amount of time, such as stream processing or indexed nested loop join.
Modern processors exploit SIMD execution to increase com- pute density. Researchers have used SIMD instructions to improve search performance in tree structured index [28, 31]. Zhou et al. [31] employ SIMD instructions to improve binary search performance. Instead of comparing a search key with one tree element, SIMD in- structions allow K(=SIMD width) consecutive elements to be com- pared simultaneously. However, their SIMD comparison in a node only splits the search into two ranges just as in binary search. Over- all, the total computation complexity is still log(N /K ). Recently, P-ary and K-ary search have been proposed to exploit SIMD execu- tion on GPUs [21] and CPUs [28]. Using the GPU’s memory gather capability, P-ary accelerates search on sorted lists. The SIMD com- parison splits the tree into (K + 1) ranges, thus reducing the total number of comparisons by a larger number of logK (N ). To avoid non-contiguous memory accesses at each level, they linearize the K-ary tree in increasing order of levels by rearranging elements. However, when the tree is large, traversing the bottom part of the tree incurs TLB/cache misses at each level and search becomes la- tency bound. We explore latency hiding techniques for CPUs and GPUs to improve instruction throughput, resulting in better SIMD utilization.
Another architectural trend affecting search is that main mem- ory bandwidth is becoming a critical resource that can limit per- formance scaling on future processors. Traditionally, the problem has been disk I/O bandwidth. Compression techniques have been used to overcome disk I/O bottleneck by increasing the effective memory capacity [15, 17, 20]. The transfer unit between mem- ory and processor cores is a cache line. Compression allows each cache line to pack more data and increases the effective memory bandwidth. This increased memory bandwidth can improve query processing speed as long as decompression overhead is kept mini- mal [19, 33]. Recently, SIMD execution has been applied to further reduce decompression overhead in the context of scan operations on compressed data [30].
While there is much research on handling numerical data in in- dex trees [9, 18, 21, 25, 26, 28], there are relatively few studies on handling variable length keys [5, 7, 8]. Compression techniques can be used to shrink longer keys into smaller keys. For tree struc- tured indexes, compressing keys increases the fanout of the tree and decreases the tree height, thus improving search performance by reducing cache misses. Binnig et al. [7] apply the idea of buffered search [32] to handle variable size keys in a cache-friendly manner when the response time is of less concern as compared to through- put and lookups can be handled in bulk.
3. ARCHITECTURE OPTIMIZATIONS
Efficient utilization of compute resources depends on how to extract instruction-level parallelism (ILP), thread-level parallelism
RELATED WORK
B+-trees [13] were designed to accelerate searches on disk-based database systems. As main memory sizes become large enough to store databases, T-trees [23] have been proposed as a replacement, specifically tuned for main memory index structure. While T-trees have less storage overhead than B+-trees, Rao et al. [25] showed that B+-trees actually have better cache behavior on modern pro- cessors because the data in a single cache line is utilized more effi- ciently and used in more comparisons. They proposed a CSS-tree, where each node has a size equal to the cache line size with no child pointers. Later, they applied cache consciousness to B+-trees and proposed a CSB+-tree [26] to support efficient updates. Graefe et al. [16] summarized techniques of improving cache performance on B-tree indexes.
Hankins et al. [18] explored the effect of node size on the perfor- mance of CSB+-trees and found that using node sizes larger than a cache line size (i.e., larger than 512 bytes) produces better search performance. While trees with nodes that are of the same size as a cache line have the minimum number of cache misses, they found that TLB misses are much higher than on trees with large node sizes, thus favoring large node sizes. Chen at al. [9] also concluded that having a B+-tree node size larger than a cache line performs better and proposed pB+-trees, which tries to minimize the increase of cache misses of larger nodes by inserting software prefetches. Later, they extended pB+-trees (called “fpB+-trees”) to optimize for both disk I/O and CPU caches for disk-based DBMS [10].
While using prefetches in pB+-trees reduces cache misses in the search within a node, all published tree structured indexes suffer from a full cache miss latency when moving from each node to its child. Chen at al. argue that prefetching the children is not efficient because the tree node fanout is large and each child will be visited
(TLP), and data-level parallelism (DLP) while managing usage of external memory bandwidth judiciously.
Most CPUs and GPUs are capable of issuing one or more in- structions per cycle. The latency of memory instructions can pre- vent execution of other instructions. Every memory access must go through a virtual-to-physical address translation, which is in the critical path of program execution. To improve translation speed, a translation look aside buffer (TLB) is used to cache translation of most frequently accessed pages. If the translation is not found in the TLB, processor pipeline stalls until the TLB miss is served. Both last-level cache (LLC) and TLB misses are difficult to hide be- cause the miss penalty reaches more than a few hundred cycles. If the miss penalty cannot be hidden, a processor cannot fully utilize its compute resources and applications become memory latency bound. At this point, exploiting SIMD vector units is ineffective.
One way to reduce the memory access latency is prefetches. However, hardware prefetcher is not effective for irregular memory accesses like tree traversal. Software prefetch instructions are also hard to insert for tree structured indexes. Prefetching tree nodes close to the current node results in very short prefetch distance and most prefetches will be too late to be effective. Tree nodes far down from the current node can create a large fan out and prefetching all tree elements down wastes memory bandwidth significantly since only one of prefetches will be useful.
3.2 TLP/DLP
Once memory latency impact is minimized, we can exploit high density compute resources in modern processors, which have in- tegrated many cores, each with wider vector (SIMD) units. Paral- lelization and vectorization are two key techniques to exploit com- pute resources. Parallelization in search can be done trivially by assigning threads to different queries. Each core executes different search queries independently.
As for exploiting SIMD, there are multiple approaches to per- forming search. We can use a SIMD unit to speed up a single query, assign different queries to each SIMD lane and execute multiple queries in parallel, or combine these two approaches by assigning a query to a subset of SIMD lanes. The best approach depends on the underlying hardware architectures such as the SIMD width and ef- ficiency of gathering and scattering2 data. In Section 5, we describe the best SIMD search mechanism for both CPUs and GPUs.
3.3 Memory Bandwidth
With the increased compute capability, the demand for data also increases proportionally. However, main memory bandwidth is growing at a lower rate than compute [27]. Therefore, performance will not scale up to the number of cores and SIMD width if ap- plications become bandwidth bound. To bridge the enormous gap between bandwidth requirements and what the memory system can provide, most processors are equipped with several levels of on- chip memory storage (e.g., caches on CPUs and shared buffer on GPUs). If data structures being accessed fit in this storage, no band- width is utilized, thus amplifying the effective memory bandwidth.
However, if data structures are too big to fit in caches, we should ensure that a cache line brought from the memory be fully utilized before evicted out of caches (called “cache line blocking”). A cache line with a typical size of 64 bytes can pack multiple data elements in it (e.g., 16 elements of 4 byte integers). The cache line blocking
2In this paper, we use the term “gather/scatter” to represent read/write from/to non-contiguous memory locations
technique basically rearranges data elements so that the subsequent elements to be used also reside within the same cache line.
Finally, data compression techniques can be used to pack more data elements into a cache line and prevent performance to be band- width bound. In Section 6, we show integrating data compres- sion into our index tree framework to improve performance besides gaining more effective memory space.
ARCHITECTURE SENSITIVE TREE
We motivate and describe our index tree layout scheme. We also provide analytical models highlighting the compute and memory bandwidth requirements for traversing the resultant trees.
4.1 Motivation
Given a list of (key, rid) tuples sorted by the keys, a typical query involves searching for tuples containing a specific key (keyq) or a range of keys ([keyq1, keyq2]). Tree index structures are built us- ing the underlying keys to facilitate fast search operations – with run-time proportional to the depth of the trees. Typically, these trees are laid out in a breadth first fashion, starting from the root of the tree. The search algorithm involves comparing the search key to the key stored at a specific node at every level of the tree, and traversing a child node based on the comparison results. Only one node at each level is actually accessed, resulting in ineffective cache line utilization, to the linear storage of the tree. Furthermore, as we traverse deeper into the tree, each access results in accessing elements stored in different pages of memory, thereby incurring TLB misses. Since the result of the comparison is required before loading the appropriate child node, cache line prefetches cannot be issued beforehand. On modern processors, a search operation typ- ically involves a long-latency TLB/cache miss followed by small number of arithmetic operations at each level of the tree, leading to ineffective utilization of the processor resources.
Although blocking for disk/memory page size has been proposed in the past [13], the resultant trees may reduce the TLB miss la- tency, but do not necessarily optimize for effective cache line uti- lization, leading to higher bandwidth requirements. Cache line wide nodes [25] minimize the number of accessed cache lines, but cannot utilize SIMD instructions effectively. Recently, 3-ary trees [28] were proposed to exploit the 4-element wide SIMD of CPUs. They rearranged the tree nodes in order to avoid expen- sive gather/scatter operations. However, their tree structure does not naturally incorporate cache line/page blocking and their perfor- mance suffers for tree sizes larger than the last-level cache (LLC). In order to efficiently use the compute performance of processors, it is imperative to eliminate the latency stalls, and store/access trees in a SIMD friendly fashion to further speedup the run-time.
4.2 Hierarchical Blocking
We advocate building binary tre
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com