. Mc Technology Center IBM Beaverton http://www.rdrop.com/users/paulmck
Linux Technology Center IBM India Software Lab
Copyright By PowCoder代写 加微信 powcoder
IBM T. J. Watson Research Center
http://www.eecg.toronto.edu/ ̃okrieg
Read-copy update is a mechanism for con- structing highly scalable algorithms for access- ing and modifying read-mostly data structures, while avoiding cacheline bouncing, memory contention, and deadlocks that plague highly scalable operating system implementations. In particular, code that performs read-only ac- cesses may be written without any locks, atomic instructions, or writes to shared cache- lines, even in the face of concurrent updates. We reported on the basic mechanism last year, and have produced a number of LinuxTM patches implementing and exploiting read copy update.
This paper evaluates performance of a number of read copy update implementations for non- preemptive Linux kernels, and outlines a new implementation targeted to preemptive Linux kernels.
1 Introduction
The past year has seen much discussion of read-copy update and the design and coding of a number of read-copy-update implemen- tations. These implementations make a num- ber of different tradeoffs, and this paper takes a first step towards evaluating them.
Comparison of read-copy update to other con- current update mechanisms has been done else- where [McK01b, Linder02a]. These com- parisons have shown that read-copy update can greatly simplify and inprove performance of code accessing read-mostly linked-list data structures (such as FD management tables and dcache data structures). Evaluation of read-
The views expressed in this paper are the au- thors’ only, and should not be attributed to SuSE or IBM.
Read Copy Update
Rusty Technology Center IBM Canberra
Ottawa Linux Symposium 2002
copy update in other environments has shown that the read-copy update can also improve per- formance of code modifying linked-list data structures when there is a high system-wide ag- gregate update rate across all such data struc- tures [McK98a].
Section 2 fills in some background on read- copy update. Section 3 gives an overview of the design choices of the Linux read-copy up- date non-preemptive implementations. Sec- tion 4 compares performance and complexity of these implementations, with emphasis on the grace-period latency that determines the in- cremental memory overhead compared to non- read-copy-update locking algorithms. Sec- tion 5 overviews the implementations, focus- ing on call_rcu(), scheduler instrumenta- tion, and timer processing. Section 5 also de- scribes how the rcu algorithm may be adapted to a preemptible kernel. Section 6 describes future plans, Appendix A provides implemen- tation details, and Appendix B discusses mem- ory ordering issues encountered when inserting into a read-copy-protected data structure.
2 Background
This section gives a brief overview of read- copy update, more details are available else- where [McK98a, McK01a, McK01b]. Sec- tion 2.1 contains a glossary of read-copy- update-related terms, Section 2.2 presents con- cepts, Section 2.3 presents the read-copy- update API, Section 2.4 describes the IP route cache patch that uses read-copy update, Sec- tion 2.5 describes the module race reduction patch that uses read-copy update, and Sec- tion 2.6 gives an overview of how read-copy update may be used in a preemptive kernel.
2.1 Glossary
Live Variable: A variable that might be ac- cessed before it is modified, so that its current value has some possibility of in- fluencing future execution state.
Dead Variable: A variable that will be modi- fied before it is next accessed, so that its current value cannot possibly have any in- fluence over future execution state.
Temporary Variable: A variable that is only live inside a critical section. One example is a auto variable used as a pointer while traversing a linked list.
Permanent Variable: A variable that is live outside of critical sections. One example would be the header for a linked list.1
Quiescent State: A point in the code where all of the current entity’s temporary vari- ables that were in use before a specified time are dead. In a non-preemptive Linux kernel, a context switch is a quiescent state for CPUs. In a preemptive Linux kernel, a voluntary context switch is a qui- escent state, but for threads. In this paper, quiescent states are global events, as op- posed to being associated with a specific data structure.
: Time interval during which all entities (CPUs or tasks, as appropriate) pass through at least one quiescent state. Note that any time interval containing a grace period is itself a grace period.
The key point underlying read-copy update is that if you remove all permanent-variable ref- erences to a given item, then wait for a grace
1 Yes, it is possible for the same variable to be tempo- rary sometimes and permanent at other times. However, this can lead to confusion, so is not generally recom- mended.
Ottawa Linux Symposium 2002
Search 2A Search 2B
E CWE
WC CW CwCw
Element
Element C
Search 1A Search 1B
Search 1A Search 1B
Figure 2: Read-Copy Update Handling Race
The grace period is not a fixed time duration, but is instead inferred by checking for per- CPU quiescent states, such as context switches. Since kernel threads are prohibited from hold- ing locks across a context switch, they also prohibited from holding pointers to data struc- tures protected by those locks across context switches–after all, the entire data structure could well be deleted by some other CPU at any time the lock is not held.
Therefore, a simple implementation of read- copy update might declare the grace period over once it observed each CPU performing a context switch. Now, the first phase removed all global pointers to the item being deleted, and kernel threads are not permitted to hold references to the item across a context switch. Therefore, CPUs that have performed a context switch after the completion of the first phase have no way to gain a reference to the item being deleted. Thus, once all CPUs have per- formed a context switch, it is safe to free up the item being deleted from the list.
With this approach, searches already in progress when the first phase executes might (or might not) see the item being deleted. How- ever, searches that start after the first phase completes are guaranteed to never reference this item. Therefore, the item may be safely
Figure 1: Race Between Deletion and Search
period to expire, there can be no remaining ref- erences to that item. The item can then be safely freed up. This process is described in more detail in the next section.
2.2 Concepts
Read-copy update allows lock-free read-only access to data structures that are being con- currently modified. The accessing code needs neither locks nor atomic instructions, and can often be written as if the data structure were unchanging, in a “CS 101” style. Read-copy update is typically applied to linked data struc- tures where the read side code traverses links through the data structure in a single direction.
Without special action on the update side, the read side would be prone to races with dele- tions, as illustrated in Figure 1, which shows two tasks searching a list that contains an ele- ment that is concurrently deleted by a third task (signified by the line labelled “Route Cache El- ement”). To handle such race conditions, the update side uses a two-phase update discipline:
1. Remove permanent-variable pointers to the item being deleted.
2. After a grace period has elapsed, free up the item’s memory.
Ottawa Linux Symposium 2002
void synchronize_kernel(void);
struct rcu_head {
struct list_head list;
void (*func)(void *obj);
void *arg;
void call_rcu(struct rcu_head
void (*func)(void *arg),
void *arg);
Figure 3: Read-Copy Update API
freed once all searches in progress at the end of the first phase have completed, as shown in Figure 2.
Efficient mechanisms for determining the du- ration of the grace period are key to read-copy update.
2.3 Read-Copy Update API
Figure 3 shows the external API for read-copy update. The synchronize_kernel() function blocks for a full grace period. This is a simple, easy-to-use function, but imposes ex- pensive context-switch overhead on its caller. It may not be called with locks held or from BH/IRQ context.
Another approach, taken by call_rcu() is to schedule a function to be called after the end of a full grace period. Since call_rcu() never sleeps, it may be called with locks held or from BH (and perhaps also IRQ) con- text. The call_rcu() function uses its struct rcu_head argument to store the specified callback function and argument, and the read-copy-update subsystem then uses this struct to schedule the callback invocation. An rcu_head is often placed within a structure being protected by read-copy update.
A typical use of call_rcu is shown in Fig-
1 void delete(struct el *p) 2{
3 spin_lock(&list_lock);
4 p->next->prev = p->prev;
5 p->prev->next = p->next;
6 spin_unlock(&list_lock);
7 call_rcu(&p->my_rcu_head,
8 my_free, p);
Figure 4: Read-Copy Dequeue From Doubly- Linked List
ure 4, where an element is deleted from a circular doubly linked list with a header ele- ment. Here my_free() is a wrapper around kfree(), and the lock is used only to seri- alize concurrent calls to delete(). Since the element’s next and prev pointers are un- affected, and since my_free() is not called until a grace period has elapsed, non-sleeping reading tasks may traverse the list concurrently with the deletion of the element without dan- ger of a NULL pointer or a pointer to the freel- ist. This is a common read-copy-update idiom: kfree() is replaced by a call_rcu() to a function that is a wrapper around kfree().
2.4 Read-Copy Update and IP Route Cache
Read-copy update has been used in a num- ber of OSes, including several patches to Linux [McK01b, Linder02a]. This section de- scribes how read-copy update may be used in the Linux IP route cache. This modification was done to validate the RCU implementa- tions, rather than in response to a known per- formance problem in the IP route cache.
The Linux IP route cache uses a reader-writer lock, so multiple searches may proceed in par- allel. However, the multiple readers’ lock acquisitions result in the cacheline bouncing. Read-copy update may be used to eliminate this read side cacheline bouncing:
Ottawa Linux Symposium 2002
1 @@ -314,13 +314,13 @@
2 static inline void rt_free(
3 4{ 5 – 6 + 7
struct rtable *rt)
dst_free(&rt->u.dst);
call_rcu(&rt->u.dst.rcu_head,
(void (*)(void *))dst_free,
&rt->u.dst);
11 static inline void rt_drop(
12 struct rtable *rt)
14 ip_rt_put(rt);
dst_free(&rt->u.dst);
call_rcu(&rt->u.dst.rcu_head,
(void (*)(void *))dst_free,
&rt->u.dst);
Figure 5: dst_free() Modifications
1. Delete all calls to read_lock(), read_unlock(), read_lock_bh(), and read_unlock_bh().
2. Replace all calls to write_lock(), write_unlock(), write_lock_bh(), and write_unlock_bh() with the corresponding member of the spin_lock() family of primitives.
3. Addrmb()primitivesonthereadside between the fetch of the pointer and its dereferencing. These should be replaced by read_barrier_depends() when it becomes available.
4. Replaceallcallstodst_free()witha call to call_rcu() which causes dst_free() to be invoked after the end of a following grace period, as shown in Figure 5.
This results in a significant decrease in ip_route_output_key() overhead dur-
Figure 6: IP Route Cache Speedup Using rcu
ing a workload that transmits a fixed number of random-sized IP packets to a single desti- nation, as shown in Figure 6. This workload was run on an 8-CPU 700MHz PentiumTM III XeonTM with 1MB L2 cache and 6GB of memory.
Figure 7 shows the total non-idle kernel pro- file ticks for this same workload. This data shows the IP route cache speedup is real; it is not happening at the expense of other pro- cessing in the system. The overall speedup is quite small, as expected, given that the change was not motivated by a known per- formance problem.2 More compelling Linux- based read-copy-update results include a 30% improvement for FD management [McK01b] and a 25% improvement for dcache manage- ment [Blanchard02a, Linder02a]
2.5 Read-Copy Update and Module Race Re- duction
Linux 2.4 is subject to races between module unloading and use of that module. These races
2However, we will be measuring this patch on vari- ous workloads as Linux’s scaling continues to improve.
Ottawa Linux Symposium 2002
Figure 7: IP Route Cache System Performance Using rcu
can result in the racing code that is attempt- ing to use the module holding a reference to newly freed memory, most likely resulting in an “oops.”
One way to reduce the likelihood of these races occurring is to wait for a grace pe- riod after removing the module structure from the module_list before kfree()ing it in free_module() [Kleen02a]. Races can still occur, but the race’s window has been de- creased substantially. The change is a one- liner (not counting comments), as shown in Figure 8.
As noted earlier, this change does not address all the module-unloading problems. However, we hope that it can be a basis for a full solution. This approach is now being used in production in SuSE Linux.
2.6 Read-Copy Update and Preemption
Preemption has recently been added to Linux in 2.5.4. The addition of preemption means that read side kernel code is subject to involun- tary context switches. If not taken into account,
1 @@ -1065,6 +1066,12 @@
2 p->next = mod->next;
4 spin_unlock_irqrestore(&modlist_lock, 5 flags);
7 + /* Wait for all other cpus to go
8 + * through a context switch. This
9 + * doesn’t plug all module unload
10 + * races, but at least some of 11 + * them and makes the window much 12 + * smaller.
14 + synchronize_kernel();
16 /* And free the memory. */
Figure 8: Module Unloading
this leads to premature flagging of the ends of grace periods. There are two ways to handle preemption: (1) explicitly disabling preemp- tion over read side code segments, and (2) con- sidering only voluntary context switches to be quiescent states.
Explicitly disabling preemption over read side code segments adds unwanted overhead to reading processes, and removes some of the latency benefits provided by preemption. In contrast, considering only voluntary context switches to be quiescent states allows the ker- nel to reap the full benefit of reduced latency. This scheme for tracking only voluntary con- text switches is inspired by the K42 implemen- tation [Gamsa99].3 The main drawback is in- creased length of grace periods. This paper fo- cuses on the voluntary context switches option and its effects.
3K42’s extensive use of blocking locks and short- lived threads results in use of thread termination rather than voluntary context switch as the K42 quiescent state. In addition, Linux migrates preempted tasks to other CPUs, which requires special tracking of tasks that have been preempted since their last voluntary context switch.
Ottawa Linux Symposium 2002
3 Read-Copy Update Implementa- tions
As noted earlier, the key to read-copy update is a CPU-efficient mechanism for determining the required duration of the grace period. This mechanism is permitted to overestimate the grace-period duration, but the greater the over- estimation, the greater the amount of memory that will be consumed by waiting callbacks. There are a number of simple and efficient al- gorithms to determine grace-period duration, and this paper reviews a number of them.
There are a number of design parameters for a read-copy update implementation:
1. Batching. Many implementations batch requests, so that a single grace-period identification can satisfy multiple re- quests. Batching is particularly important for implementations with heavyweight grace period identification mechanisms. Although there have been implementa- tions without batching [McK01a], all im- plementations described in this paper do batching.
2. Deducing the length of the grace period. The simplest mechanisms force a grace period by a reschedule on all CPUs in non-preemptive kernels. However, this approach is relatively expensive, particu- larly if extended to cope with preemptible kernels. More efficient implementations use something like per-CPU quiescent- state counters to deduce when the natural course of events has resulted in the expi- ration of a grace period.
3. Polling mechanism. Implementations that deduce when a grace period has ended must use some mechanism to be informed of this event:
(a) Adding explicit checks to code cor- responding to quiescent states, for example, rcu-sched’s hooks in the Linux scheduler shown in Figure 29. Explicit checks allow fast response to quiescent states, but add overhead when there are no read-copy call- backs in flight.
(b) Adding counters to code corre- sponding to quiescent states, and us- ing kernel daemons to check the counters, as shown in Figure 13. This approach adds some complex- ity, but greatly reduces the overhead when there are no read-copy call- backs in flight.
(c) As above, but use tasklets instead of kernel daemons to do the check- ing. This further reduces the over- head, but uses more exotic features of Linux.
(d) As above, but use a per-CPU timer handler [Sarma02a] instead of tasklets to do the checking. It is not yet clear which of tasklets and timer handlers are preferable.
If the implementation forces the end of the grace period, it must similarly use a mech- anism for doing so:
(a) Scheduling a thread on each CPU in turn. This has the advantage of im- mediacy, but cannot be used from BH or IRQ, and gains no perfor- mance benefit from batching.
(b) Reserving a kernel daemon that, upon request, schedules itself on each CPU in turn. This permits batching and use from BH and IRQ, but is more complex.
Request queuing. Requests may be queued globally or on a per-CPU basis.
Ottawa Linux Symposium 2002 345
Grace periods must of course always be detected globally, but per-CPU queuing can reduce the CPU overhead incurred by call_rcu(). This is a classic perfor- mance/complexity tradeoff. The correct choice depends on the workload.
5. Quiescent state definition. For non- preemptive kernels, context switch is a popular choice. For preemptive Linux kernels (such as Linux 2.5), voluntary context switch may instead be used.
6. Environments. If call_rcu() use is prohibited in the BH or IRQ contexts, then more kernel functionality is available to the implementor of call_rcu(), and less overhead is incurred.
Ottawa Linux Symposium 2002
Section 5 describes a number of Linux imple- mentations of read-copy update, summarized in Table 1.
All the implementations in Table 1 except rcu- preempt assume a run-to-block kernel. Sec- tion 5.7 describes rcu-preempt, which operates efficiently in a preemptive kernel.
The “QS” column lists the quiescent states that each algorithm tracks, “I” for idle-loop execu- tion, “C” for context switch, and “U” for user- mode execution.
The “BH/IRQ Safe” column indicates whether code running in BH/IRQ context may safely delete elements of a r
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com