程序代写 When Poll is Better than Interrupt

When Poll is Better than Interrupt
Jisoo B. Minturn Frank Hady
{jisoo.yang | dave.b.minturn | frank.hady} (at) intel.com
Intel Corporation

Copyright By PowCoder代写 加微信 powcoder

In a traditional block I/O path, the operating system com- pletes virtually all I/Os asynchronously via interrupts. However, performing storage I/O with ultra-low latency devices using next-generation non-volatile memory, it can be shown that polling for the completion – hence wasting clock cycles during the I/O – delivers higher performance than traditional interrupt-driven I/O. This paper thus argues for the synchronous completion of block I/O first by presenting strong empirical evidence showing a stack latency advantage, second by delineating limits with the current interrupt-driven path, and third by proving that synchronous completion is indeed safe and correct. This paper further discusses challenges and op- portunities introduced by synchronous I/O completion model for both operating system kernels and user appli- cations.
1 Introduction
When an operating system kernel processes a block sto- rage I/O request, the kernel usually submits and com- pletes the I/O request asynchronously, releasing the CPU to perform other tasks while the hardware device com- pletes the storage operation. In addition to the CPU cycles saved, the asynchrony provides opportunities to reorder and merge multiple I/O requests to better match the characteristics of the backing device and achieve higher performance. Indeed, this asynchronous I/O strat- egy has worked well for traditional rotating devices and even for NAND-based solid-state drives (SSDs).
Future SSD devices may well utilize high-performance next-generation non-volatile memory (NVM), calling for a re-examination of the traditional asynchronous comple- tion model. The high performance of such devices both diminish the CPU cycles saved by asynchrony and re- duce the I/O scheduling advantage.
This paper thus argues for the synchronous I/O comple- tion model by which the kernel path handling an I/O re- quest stays within the process context that initiated the I/O. Synchronous completion allows I/O requests to by-
pass the kernel’s heavyweight asynchronous block I/O subsystem, reducing CPU clock cycles needed to process I/Os. However, a necessary condition is that the CPU has to spin-wait for the completion from the device, increas- ing the cycles used.
Using a prototype DRAM-based storage device to mimic the potential performance of a very fast next-generation SSD, we verified that the synchronous model completes an individual I/O faster and consumes less CPU clock cycles despite having to poll. The device is fast enough that the spinning time is smaller than the overhead of the asynchronous I/O completion model.
Interrupt-driven asynchronous completion introduces additional performance issues when used with very fast SSDs such as our prototype. Asynchronous completion may suffer from lower I/O rates even when scaled to many outstanding I/Os across many threads. We empiri- cally confirmed this with Linux,* and examine the sys- tem overheads of interrupt handling, cache pollution, CPU power-state transitions associated with the asyn- chronous model.
We also demonstrate that the synchronous completion model is correct and simple with respect to maintaining I/O ordering when used with application interfaces such as non-blocking I/O and multithreading.
We suggest that current applications may further benefit from the synchronous model by avoiding the non- blocking storage I/O interface and by reassessing buffer- ing strategies such as I/O prefetching. We conclude that with future SSDs built of next-generation NVM ele- ments, introducing the synchronous completion model could reap significant performance benefits.
2 Background
The commercial success of SSDs coupled with reported advancements of NVM technology is significantly reduc- ing the performance gap between mass-storage and memory [15]. Experimental storages device that com- plete an I/O within a few microseconds have been dem- onstrated [8]. One of the implications of this trend is that

the once negligible cost of I/O stack time becomes more relevant [8,12]. Another important trend in operating with SSDs is that big, sequential, batched I/O requests need no longer be favored over small, random I/O requests [17].
In the traditional block I/O architecture, the operating system’s block I/O subsystem performs the task of sche- duling I/O requests and forwarding them to block device drivers. This subsystem processes kernel I/O requests specifying the starting disk sector, target memory ad- dress, and size of I/O transfer, and originating from a file system, page cache, or user application using direct I/O. The block I/O subsystem schedules kernel I/O requests by queueing them in a kernel I/O queue and placing the I/O-issuing thread in an I/O wait state. The queued re- quests are later forwarded to a low-level block device driver, which translates the requests into device I/O com- mands specific to the backing storage device.
Upon finishing an I/O command, a storage device is ex- pected to raise a hardware interrupt to inform the device driver of the completion of a previously submitted com- mand. The device driver’s interrupt service routine then notifies the block I/O subsystem, which subsequently ends the kernel I/O request by releasing the target memo- ry and un-blocking the thread waiting on the completion of the request. A storage device may handle multiple device commands concurrently using its own device queue [2,5,6], and may combine multiple completion interrupts, a technique called interrupt coalescing to re- duce overhead.
As described the traditional block I/O subsystem uses asynchrony within the I/O path to save CPU cycles for other tasks while the storage device handles I/O com- mands. Also, using I/O schedulers, the kernel can reorder or combine multiple outstanding kernel I/O requests to better utilize the underlying storage media.
This description of the traditional block storage path cap- tures what we will refer to as the asynchronous I/O com- pletion model. In this model, the kernel submits a device I/O command in a context distinct from the context of the process that originated the I/O. The hardware interrupt generated by the device upon command completion is also handled, at first, by a separate kernel context. The original process is later awakened to resume its execu- tion.
A block I/O subsystem typically provides a set of in- kernel interfaces for a device driver use. In Linux, a block device driver is expected to implement a ‘request_fn’ callback that the kernel calls while executing in an inter- rupt context [7,10]. Linux provides another callback point called ‘make_request’, which is intended to be used by pseudo block devices, such as a ramdisk. The latter call- back differs from the former one in that the latter is posi-
tioned at highest point in the Linux’s block I/O subsys- tem and called within the context of the process thread.
3 SynchronousI/Ocompletionmodel
When we say a process completes an I/O synchronously, we mean the kernel’s entire path handling an I/O request stays within the process context that initiated the I/O. A necessary condition for this synchronous I/O completion is that the CPU poll the device for completion. This pol- ling must be realized by a spin loop, busy-waiting the CPU while waiting for the completion.
Compared to the traditional asynchronous model, syn- chronous completion can reduce CPU clock cycles needed for a kernel to process an I/O request. This reduc- tion comes primarily from a shortened kernel path and from the removal of interrupt handling, but synchronous completion brings with it an extra clock cycles spent in polling. In this section, we make the case for the syn- chronous completion by quantifying these overheads. We then discuss problems with the asynchronous model and argue the correctness of synchronous model.
3.1 Prototype hardware and device driver
For our measurements, we used a DRAM-based proto- type block storage device connected to the system with an early prototype of an NVM Express* [5] interface to serve as a model of a fast future SSD based on next- generation NVM. The device was directly attached to PCIe* Gen2 bus with eight lanes and with a device-based DMA engine handling data transfers. As described by the NVM Express specification the device communicates with the device driver via segments of main memory, through which the device receives commands and places completions. The device can instantiate multiple device queues and can be configured to generate hardware inter- rupts upon command completion.
I/O completion method 512B xfer 4KiB xfer
Interrupt, Gen2 bus, enters C-state 3.3 μs 4.6 μs
Interrupt, Gen2 bus 2.6 μs 4.1 μs
Polling, Gen2 bus 1.5 μs 2.9 μs
Interrupt,8Gbpsbusprojection 2.0 μs 2.6 μs
Polling,8Gbpsbusprojection 0.9 μs 1.5 μs
Table 1. Time to finish an I/O command, excluding software time, measured for our prototype device. The numbers measure random-read performance with device queue depth of 1.
Table 1 shows performance statistics for the prototype device. The ‘C-state’ refers to the latency when the CPU enters power-saving mode while the I/O is outstanding. The performance measured is limited by prototype throughput, not by anything fundamental, future SSDs may well feature higher throughputs. The improved per-

4KiB Async (C-state)
Hardware device Operating system
512B 4KiB Async Async
512B Async
4KiB 512B Sync Sync
Figure 1. Storage stack block I/O subsystem cost comparison. Each bar measures application-observed I/O completion latency, which is broken into device hardware latency and non- overlapping operating system latency. Error bars represent +/- one standard deviation.
formance projection assumes a higher throughput SSD on a saturated PCIe Gen3 bus (8Gbps).
We wrote a Linux device driver for the prototype hard- ware supporting both asynchronous and synchronous completion models. For the asynchronous model the driver implements Linux’s ‘request_fn’ callback, thus taking the traditional path of using the stock kernel I/O queue. In this model, the driver uses a hardware interrupt. The driver executes within the interrupt context for both the I/O request submission and the completion. For the synchronous model, the driver implements Linux’s ‘make_request’ callback, bypassing most of the Linux’s block I/O infrastructure. In this model the driver polls for completion from device and hence executes within the context of the thread that issued the I/O.
For this study, we assume that hardware never triggers internal events that incur substantially longer latency than average. We expect that such events are rare and can be easily dealt with by having operating system fall back to traditional asynchronous model.
3.2 Experimental setup and methodology
We used 64bit Fedora* 13 running 2.6.33 kernel on an x86 dual-socket server with 12GiB of main memory. Each processor socket was populated with quad-core 2.93GHz Intel® Xeon® with 8MiB of shared L3 cache and 256KiB of per-core L2 cache. Intel® Hyper- Threading Technology was enabled totaling 16 architec- tural CPUs available to software. CPU frequency-scaling was disabled.
For measurements we used a combination of the CPU timestamp counter and reports from user-level programs. Upon events of interest in kernel, the device driver ex- ecuted the ‘rdtsc’ instruction to read the CPU timestamp counter, whose values were later processed offline to produce kernel path latencies. For application IOPS (I/O Operations Per Second) and I/O system call completion latency, we used the numbers reported by ‘fio’ [1] I/O micro-benchmark running in user mode.
We bypassed the file system and the buffer cache to iso- late the cost of the block I/O subsystem. Note that our objective is to measure the difference between the two completion models when exercising the back-end block I/O subsystem whose performance is not changed by the use of the file system or the buffer cache and would thus be additive to either completion model. The kernel was compiled with -O3 optimization and kernel preemption was enabled. The I/O scheduler was disabled for the asynchronous path by selecting ‘noop’ scheduler in order to make the asynchronous path as fast as possible.
3.3 Storage stack latency comparison
Our measurement answers following questions:
 How fast does each completion path complete appli- cation I/O requests?
 How much CPU time is spent by the kernel in each completion model?
 How much CPU time is available to another user process scheduled in during an asynchronous I/O?
Figure 1 shows that the synchronous model completes an I/O faster than asynchronous path in terms of absolute latency. The figure shows actual measured latency for the user application performing 4KiB and 512B random reads. For our fast prototype storage device the CPU spin-wait cost in the synchronous path is lower than the code-path reduction achieved by the synchronous path, completing a 4KiB I/O synchronously in 4.4μs versus 7.6μs for the asynchronous case. The figure breaks the latency into hardware time and non-hardware overlap- ping kernel time. The hardware time for the asynchron- ous path is slightly greater than that of the synchronous path due to interrupt delivery latency.
Figure 2 details the latency component breakdown of the asynchronous kernel path. In the figure, Tu indicates the CPU time actually available to another user process dur- ing the time slot vacated during asynchronous path I/O completion. To measure this time as accurately as possi- ble, we implemented a separate user-level program sche- duled to run on the same CPU as the I/O benchmark. This program continuously checked CPU timestamps to detect its scheduled period at a sub-microsecond granu- larity. Using this program, we measured Tu to be 2.7μs with 4KiB transfer that the device takes 4.1μs to finish.
The conclusion of the stack latency measurements is a strong one: the synchronous path completes I/Os faster and more efficiently uses the CPU. This is true despite spin-waiting for the duration of the I/O because the work the CPU performs in asynchronous path (i.e., Ta + Tb =
I/O completion latency in usec

Figure 2. Latency component breakdown of asynchronous ker- nel path. Ta (= Ta’ + Ta”) indicates the cost of kernel path that does not overlap with Td, which is the interval during which the device is active. Scheduling a user process P2 during the I/O interval incurs kernel scheduling cost, which is Tb. The CPU time available for P2 to make progress is Tu. For a 4KiB trans- fer, Ta, Td, Tb, and Tu measure 4.9, 4.1, 1.4 and 2.7μs, respec- tively.
Sync IOPS (Thousand)
Async IOPS (Thousand) 1648
Number of CPUs
Figure 3. Scaling of storage I/Os per second (IOPS) with in- creased number of CPUs. For asynchronous IOPS, I/O threads are added until the utilization of each CPU reaches 100%.
6.3μs) is greater than the spin-waiting time of the syn- chronous path (4.38μs) with this fast prototype SSD. For smaller-sized transfers, synchronous completion by pol- ling wins over asynchronous completion by an even greater margin.
With the synchronous completion model, improvement in hardware latency directly translates to improvement in software stack overhead. However, the same does not hold for the asynchronous model. For instance, using projected PCIe Gen3 bus performance, the spin-wait time is expected to be reduced from current 2.9μs to 1.5μs, making the synchronous path time be 3.0μs, while the asynchronous path overhead remains the same at 6.3μs. Of course the converse is also true, slow SSDs will be felt by the synchronous model, but not by the asynchron- ous model – clearly these results are most relevant for very low latency NVM.
This measurement study also sets a lower bound on the SSD latency for which the asynchronous completion model recovers absolutely no useful time for other processes: 1.4μs (Tb in Figure 2).
3.4 Further issues with interrupt-driven I/O
The increased stack efficiency gained with the synchron- ous model for low latency storage devices does not just result in lower latency, but also in higher IOPS. Figure 3 shows the IOPS scaling for increasing number of CPUs performing 512B randomly addressed reads. For this test, both the synchronous and asynchronous models use 100% of each included CPU. The synchronous model does so with just a single thread per CPU, while the asynchronous model required up to 8 threads per CPU to achieve maximum IOPS. In the asynchronous model, the total number of threads needed increases with number of processors to compensate for the larger per-I/O latency.
The synchronous model shows the best per-CPU I/O performance, scaling linearly with the increased number of CPUs up to 2 million IOPS – the hardware limitation
of our prototype device. Even with its larger number of threads per CPU, the asynchronous model displays a significantly lower I/O rate, achieving only 60-70% of the synchronous model. This lower I/O rate is a result of inefficiencies inherent in the use of the asynchronous model when accessing such a low latency storage device. We discuss these inefficiencies in the following sections. It should be noted that this discussion is correct only for a very low latency storage device, like the one used here: traditional higher latency storage devices gain compelling efficiencies from the use the asynchronous model.
Interrupt overhead
The asynchronous model necessarily includes generation and service of an interrupt. This interrupt brings with it extra, otherwise unnecessary work increasing CPU utili- zation and therefore decreasing I/O rate on a fully loaded system. Another problem is that the kernel processes hardware interrupts at high priority. Our prototype device can deliver hundreds of thousands interrupts per second. Even if the asynchronous model driver completes mul- tiple outstanding I/Os during a single hardware interrupt invocation, the device generates interrupts fast enough to saturate the system and cause user noticeable delays. Further while coalescing interrupts reduces CPU utiliza- tion overhead, it also increases completion latencies for individual I/Os.
Cache and TLB pollution
The short I/O-wait period in asynchronous model can cause a degenerative task schedule, polluting hardware cache and TLBs. This is because the default task schedu- ler eagerly finds any runnable thread to fill in the slot vacated by an I/O. With our prototype, the available time for a schedule in thread is only 2.7μs, which equals 8000 CPU clock cycles. If the thread scheduled is lower priori- ty than the original thread, the original thread will likely be re-scheduled upon the completion of the I/O – lots of state swapping for little work done. Worse, thread data held in hardware resources such as memory cache and TLBs are replaced, only to be re-populated again when the original thread is scheduled back.

CPU power-state complications
Power management used in conjunction with the asyn- chronous model for the short I/O-wait of our device may not only reduce the power saving, but also increase I/O completion latency. A modern processor may enter a power-saving ‘C-state’ when not loaded or lightly loaded. Transition among C-states incurs latency. For the asyn- chronous model, the CPU enters into a power saving C- state when the scheduler fails to find a thread to run after sending an I/O command. The synchronous model does not automatically allow this transition to a lower C-state since the processor is busy.
We have measured a latency impact from C-state transi- tion. When the processor enters into a C-state, the asy

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com