代写代考 ISSN 2150-8097. DOI: https://doi.org/10.14778/3389133.3389134 ∗Work partial

Scaling Dynamic Hash Tables on Real Persistent Memory
Baotong Lu1∗ 2 Tianzheng Wang2 1
1The Chinese University of {btlu,
2 University {xha62,

Copyright By PowCoder代写 加微信 powcoder

Byte-addressable persistent memory (PM) brings hash tables the potential of low latency, cheap persistence and instant recovery. The recent advent of Intel Optane DC Persistent Memory Modules (DCPMM) further accelerates this trend. Many new hash table designs have been proposed, but most of them were based on emulation and perform sub-optimally on real PM. They were also piecewise and partial solutions that side-stepped many important properties, in particular good scalability, high load factor and instant recovery.
We present Dash, a holistic approach to building dynamic and scalable hash tables on real PM hardware with all the aforementioned properties. Based on Dash, we adapted two popular dynamic hashing schemes (extendible hashing and linear hashing). On a 24-core server with Optane DCPMM, compared to state-of-the-art, Dash can achieve up to ∼3.9× higher performance with up to over 90% load factor and an instant recovery time of 57ms regardless of data size.
1. INTRODUCTION
Dynamic hash tables that can grow and shrink as needed at runtime are a fundamental building block of many data- intensive systems, such as database systems [11, 14] and key-value stores [3,15]. Persistent memory (PM) such as 3D XPoint [1] promises byte-addressability, persistence, high ca- pacity, low cost and high performance, all on the memory bus. These features make PM very attractive for building dynamic hash tables that persist and operate directly on PM, with high performance and instant recovery. The recent release of Intel Optane DC Persistent Memory Module (DCPMM) brings this vision closer to reality. Since PM exhibits several distinct properties (e.g., asymmetric read/write speeds and higher latency); blindly applying prior disk or DRAM based approaches [2,7] would not reap its full benefits, necessitating a departure from conventional designs.
⃝c Copyright is held by the owner/author(s). Publication rights licensed to the VLDB Endowment. This is a minor revision of the paper entitled “Dash: Scalable Hashing on Per- sistent Memory”, published in PVLDB, Vol. 13, No. 8, ISSN 2150-8097. DOI: https://doi.org/10.14778/3389133.3389134 ∗Work partially performed while at University.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
Copyright 2008 ACM 0001-0782/08/0X00 …$5.00.
Level Hashing
Number of threads
CCEH (ideal)
Level Hashing (ideal)
Number of threads
40 30 20 10
Figure 1: Throughput of state-of-the-art PM hash tables (CCEH [10] and Level Hashing [22]) for insert (left) and search (right) operations on Optane DCPMM. Neither matches the
expected scalability.
1.1 Hashing on PM: Not What You Assumed!
There have been a new breed of hash tables specifically designed for PM [10, 17, 22] based on DRAM emulation, before actual PM was available. Their main focus is to reduce cacheline flushes and PM writes for scalable performance. But when they are deployed on real Optane DCPMM, we find (1) scalability is still a major issue, and (2) desirable properties are often traded off.
Figure 1 shows the throughput of two state-of-the-art PM hash tables [10,22] under insert (left) and search (right) op- erations, on a 24-core server with Optane DCPMM running workloads under uniform key distribution (details in Sec- tion 6). As core count increases, neither scheme scales for inserts, or even read-only search operations. Corroborating with recent work [8,20], we find the main culprit is Optane DCPMM’s limited bandwidth, which is ∼3–14× lower than DRAM’s. Although the server is fully populated to provide the maximum possible bandwidth, excessive PM accesses can still easily saturate and prevent the system from scaling. We describe two main sources of excessive PM accesses that were not given enough attention before, followed by a discussion of important but missing functionality in prior work.
Excessive PM Reads. Reducing PM writes has been a main theme in recent work, but many existing solutions incur more PM reads. We note that it is also imperative to reduce PM reads. Different from the device-level behavior (PM reads being faster than writes), end-to-end write latency (i.e., the entire data path including CPU caches and write buffers in the memory controller) is often lower than reads [20]. The reason is while PM writes can leverage write buffers, PM reads mostly touch the PM media due to hash table’s inherent random access patterns. In particular, existence checks in record probing constitute a large proportion of such PM reads: to find out if a key exists, multiple buckets may
SIGMOD Record, March 2021 (Vol. 50, No. 1)
Million ops/s
Million ops/s

have to be searched, incurring many cache misses and PM reads when comparing keys.
Heavyweight Concurrency Control. Most prior work side-stepped the impact of concurrency control. Bucket- level locking has been widely used [10,22], but it incurs additional PM writes to acquire/release read locks, further pushing bandwidth consumption towards the limit. Lock- free designs [17] can avoid PM writes for read-only probing operations, but are notoriously hard to get right, more so on PM when safe persistence is necessary [19].
Neither record probing nor concurrency control typically prevents a well-designed hash table to scale on DRAM. How- ever, on PM they can easily exhaust PM’s limited bandwidth. These issues call for new designs that can reduce unneces- sary PM reads during probing and lightweight concurrency control that further reduces PM writes.
Missing Functionality. We observe in prior designs, necessary functionality was often traded off for performance (though scalability is still an issue on real PM). (1) Indexes could occupy more than 50% of memory capacity [21], so it is critical to improve load factor (records stored vs. hash table capacity). Yet high load factor is often sacrificed by organizing buckets using larger segments in exchange for smaller directories (fewer cache misses) [10]. As we describe later, this in turn can trigger more pre-mature splits and
incur even more PM accesses, impacting performance. (2) Variable-length keys are widely used in reality, but prior approaches rarely discuss how to efficiently support them. (3) Instant recovery is a unique, desirable feature that could be provided by PM, but is often omitted in prior work which requires a linear scan of the metadata whose size scales with data size. (4) Prior designs also often side-stepped the PM programming issues (e.g., PM allocation), which impact the proposed solution’s scalability and adoption in reality.
We present Dash, a holistic approach to dynamic and scalable hashing on real PM without trading off desirable properties. Dash uses a combination of new and existing techniques that are carefully engineered to achieve this goal.
1 We adopt fingerprinting [12] that was used in PM tree structures to avoid unnecessary PM reads during record prob- ing. The idea is to generate fingerprints (one-byte hashes) of keys and place them compactly to summarize the possible existence of keys. This allows a thread to tell if a key possibly exists by scanning the fingerprints which are much smaller than the actual keys. 2 Instead of traditional bucket-level locking, Dash uses an optimistic, lightweight flavor of it that uses verification to detect conflicts, rather than (expensive) shared locks. This allows Dash to avoid PM writes for search operations. With fingerprinting and optimistic concurrency, Dash avoids both unnecessary reads and writes, saving PM bandwidth and allowing Dash to scale well. 3 Dash re- tains desirable properties. We propose a new load balancing strategy to postpone segment splits with improved space uti- lization. To support instant recovery, we design Dash to only perform a constant amount of work upon restart (reading and possibly writing a one-byte counter) and amortize the
“real” recovery work to runtime. Compared to prior work that handles PM programming issues in ad hoc ways, Dash uses well-defined PM programming models (PMDK [4], one of the most popular PM libraries) to systematically handle crash consistency, PM allocation and achieve instant recovery.
Although these techniques are not all new, Dash is the first to integrate them for building hash tables that scale without sacrificing features on real PM. Techniques in Dash can be applied to various static and dynamic hashing schemes. In this paper, we focus on dynamic hashing and apply Dash to two classic and widely-used approaches: extendible hash- ing [2,10] and linear hashing [7]. Evaluation using a 24-core Intel Xeon Scalable CPU and 768GB of Optane DCPMM shows that compared to existing state-of-the-art [10, 22], Dash achieves up to ∼3.9× better performance on realistic workloads, up to over 90% of load factor with high space utilization and the ability to instantly recover in 57ms re- gardless of data size. Our implementation is open-source at: https://github.com/baotonglu/dash.
2. BACKGROUND
We first give necessary background on PM hardware and dynamic hashing, then discuss issues in prior PM hash tables.
2.1 Intel Optane DC Persistent Memory
Hardware. We target Optane DCPMMs (in DIMM form factor). In addition to byte-addressability and persistence, DCPMM offers high capacity at a price lower than DRAM’s. Similar to other work [10,22], we leverage its AppDirect mode, as it provides more flexibility and persistence guarantees.
System Architecture. Current mainstream CPU archi- tectures (e.g., Intel Cascade Lake) place DRAM and PM behind multiple levels of volatile CPU caches. Data is not guaranteed to be persisted in PM until a cacheline flush instruction (e.g., CLWB [5]) is executed or other events that implicitly cause cacheline flush occur. Writes to PM may be reordered, requiring fences to avoid undesirable reordering. The application (hash tables in our case) must issue fences and cacheline flushes to ensure correctness. After a cacheline of data is flushed, it will reach the asynchronous DRAM refresh (ADR) domain which includes a write buffer and a write pending queue with persistence guarantees upon power failure. Once data is in the ADR domain (not necessarily the DCPMM media), it is considered persistent.
Future CPU and DCPMM generations (e.g., the upcom- ing Intel Ice Lake CPUs) may feature extended ADR (eADR) which further includes the CPU cache in the ADR domain [16], effectively providing persistent CPU caches and thus elimi- nating the need for cacheline flushes (fences are still needed). The current implementation of Dash still issues cacheline flushes, however, our preliminary experiments that skip cache- line flushes on existing Cascade Lake CPUs showed eADR’s potential impact is very small for hash tables. We attribute the reason to hash table’s inherently random access patterns.
Performance Characteristics. At the device level, as many previous studies have shown, PM exhibits asymmetric read and write latency, with writes being slower. It exhibits ∼300ns read latency, ∼4× longer than DRAM’s. More recent studies [20], however revealed that on Optane DCPMM, read latency as seen by the software is often higher than write latency. This is attributed to the fact that writes (store instructions) commit (also ensure persistence) once the data reaches the ADR domain at the memory controller rather than when reaching the PM media. On the contrary, a read operation often needs to touch the actual media unless the data being accessed is cache-resident (which is rare in data structures with inherent randomness, e.g., hash tables). Tests also showed that the bandwidth of DCPMM depends
SIGMOD Record, March 2021 (Vol. 50, No. 1)

002 012 102 112 0002 0012 0102 0112 1002 1012 1102 1112
(b) After inserting 30. Global depth becomes 3 and directory doubled.
Directory: Buckets:
Localdepth: 2 2 2 2
(a) Before inserting 30. Global depth is 2.
An example of extendible hashing.
as a result of inserts, and eventually the overflowed bucket will be split and have its keys redistributed. If a bucket is full and an insert is requested to it, more overflow buckets will be created and chained together with the original, full bucket. Compared with extendible hashing, linear hashing could have smaller directory size by proper organization [7].
2.3 Dynamic Hashing on PM
To reduce PM accesses on dynamic extendible hashing, CCEH [10] groups buckets into larger segments. Each di- rectory entry then points to a segment which consists of a fixed number of buckets. This design reduces the size of the directory, making it more likely to be cached entirely by the CPU, which helps reducing access to PM. Note that split now happens at the segment (instead of bucket) level. A segment is split once any bucket in it is full, even if the other buckets in the segment still have free slots, which results in low load factor and more PM accesses. To reduce such pre- mature splits, linear probing can be used to allow a record to be inserted into a neighbor bucket. However, this improves load factor at the cost of more PM accesses. Thus, most approaches bound probing distance to a fixed number, e.g., CCEH probes no more than four cachelines. However, our evaluation (Section 6) shows that linear probing alone is not enough in achieving high load factor.
Another important aspect of dynamic PM hashing is to ensure failure atomicity, particularly during segment split which involves lots of PM writes. Existing approaches such as CCEH side-step PM management issues surrounding segment split, having the risk of permanent PM leaks.
3. DESIGN PRINCIPLES
The discussions in Section 2 lead to the following design principles of Dash:
• Avoid both Unnecessary PM Reads and Writes. Probing performance impacts not only search operations, but also all the other operations. Therefore, in addition to reducing PM writes, Dash must avoid unnecessary PM reads to conserve bandwidth and alleviate the impact of high end-to-end read latency.
• Lightweight Concurrency. Dash must scale well on multicore machines with persistence guarantees. Given the limited bandwidth, concurrency control must be lightweight to incur not much overhead (i.e., avoid PM writes for search operations, such as read locks). Ideally, it should also be relatively easy to implement.
• Full Functionality. Dash must not sacrifice or trade off important features that make a hash table useful in practice. In particular, it needs to support instant recovery and variable-length keys and achieve high space utilization.
on many factors of the workload. In general, compared to DRAM, it exhibits ∼3×/∼8× slower sequential/random read bandwidth. The numbers for sequential/random write are ∼11×/∼14×. Notably, DCPMM exhibits very limited per- formance for small, random accesses [20], which are inherent access pattern for hash tables. These properties exhibit a stark contrast to prior estimates [13,18], and lead to signifi- cantly lower performance of many prior designs on DCPMM than originally reported. Thus, it is important to reduce both PM reads and writes for higher performance.
2.2 Dynamic Hashing
Now we give an overview of extendible hashing [2] and lin- ear hashing [7]. We focus on their memory-friendly versions which PM-adapted hash tables were based upon.
Extendible Hashing. The crux of extendible hashing is its use of a directory to index buckets so that they can be added and removed dynamically at runtime. When a bucket is full, it is split into two new buckets with keys redistributed. The directory may get expanded (doubled) if there is not enough space to store pointers to the new bucket. Figure 2(a) shows an example with four buckets, each of which is pointed to by a directory entry. In the figure, indices of directory entries are shown in binary. The two least significant bits
(LSBs) of the hash value are used to select a bucket; we call the number of suffix bits being used here the global depth. The hash table can have at most 2global depth directory entries (buckets). A search operation follows the pointer in the corresponding directory entry to probe the bucket. Each bucket also has a local depth. The number of directory entries pointing to one bucket is 2global depth−local depth. This allows us to determine whether a directory doubling is needed: if a bucket whose local depth equals the global depth is split (e.g., bucket 012 in Figure 2(a)), then the directory needs to be doubled to accommodate the new bucket. Figure 2 shows an example of splitting the full bucket 012 when an inserting key 30 is hashed into that bucket. After bucket splitting, local depth of the splitting bucket needs to be properly updated. Choosing a proper hash function that evenly distributes keys to all buckets is an important but orthogonal problem.
Linear Hashing. In-memory linear hashing takes a sim- ilar approach to organizing buckets using a directory with entries pointing to individual buckets [7]. The main differ- ence compared to extendible hashing is that in linear hashing, the bucket to be split is chosen “linearly.” That is, it keeps a pointer (page ID or address) to the bucket to be split next and only that bucket would be split in each round, and advances the pointer to the next bucket when the split of the current bucket is finished. Therefore, the bucket being split is not necessarily the same as the bucket that is full
Dash FOR EXTENDIBLE HASHING
SIGMOD Record, March 2021 (Vol. 50, No. 1)
Based on the principles in Section 3, we describe Dash in the context of Dash-Extendible Hashing (Dash-EH). We discuss how Dash applies to linear hashing in Section 5.
4.1 Overview
Similar to prior approaches [7,10], Dash-EH uses segmen- tation. As shown in Figure 3, each directory entry points to a segment which consists of a fixed number of normal buckets and stash buckets for overflow records from normal buckets

Segment 0:
. . . Bucket Bucket Bucket Bucket . . .
b-1 b b+1 b+2
Segment n:
1. Balanced insert 3. Stash
2. Displace
. . . Bucket Bucket Bucket Bucket . . .
b-1 b b+1 b+2
Stash buckets
Stash buckets
224 bytes (16-byte x 14 pairs)
Records (key-value pairs)
Version lock (4 bytes)
FP 1 FP 2 . . . 14 + 4 fingerprints
Overflow fingerprint bitmap
Membership
Alloc. bitmap
Overflow bit
Stash bucket index
Overflow membership
Overflow count
Figure 3: Overall architecture of Dash-EH.
which did not have enough space for the inserts. The lock, version number and clean marker are for concurrency control and recovery, which we describe later.
Figure 4 shows the internals of a bucket. We place the metadata used for bucket probing on the first 32 bytes, followed by multiple 16-byte record slots. The first 8 bytes in each slot store the key (or a pointer to it for keys longer than 8 bytes). The remaining 8 bytes store the payload which is opaque to Dash; it can be an inlined value or a pointer, depending on the application’s need. The size of a bucket is adjustable. In our current implementation it is set to 256-byte, which allows us to store 14 records per bucket.
The 32-byte metadata includes key data structures for Dash-EH to handle hash table operations and realize the design princi

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com