CS代写 ACM 978-1-4503-0053-7/10/06 …$10.00.

Translation Caching: Skip, Don’t Walk (the Page Table)
This paper explores the design space of MMU caches that accel- erate virtual-to-physical address translation in processor architec- tures, such as x86-64, that use a radix tree page table. In particular, these caches accelerate the page table walk that occurs after a miss in the Translation Lookaside Buffer. This paper shows that the most effective MMU caches are translation caches, which store partial translations and allow the page walk hardware to skip one or more levels of the page table.
In recent years, both AMD and Intel processors have imple- mented MMU caches. However, their implementations are quite different and represent distinct points in the design space. This paper introduces three new MMU cache structures that round out the design space and directly compares the effectiveness of all five organizations. This comparison shows that two of the newly intro- duced structures, both of which are translation cache variants, are better than existing structures in many situations.
Finally, this paper contributes to the age-old discourse concern- ing the relative effectiveness of different page table organizations. Generally speaking, earlier studies concluded that organizations based on hashing, such as the inverted page table, outperformed organizations based upon radix trees for supporting large virtual address spaces. However, these studies did not take into account the possibility of caching page table entries from the higher lev- els of the radix tree. This paper shows that any of the five MMU cache structures will reduce radix tree page table DRAM accesses far below an inverted page table.

Copyright By PowCoder代写 加微信 powcoder

Categories and Subject Descriptors
C.0 [General]: Modelling of computer architecture; C.4 [Performance of systems]: Design studies; D.4.2 [Operating Sys- tems]: Virtual Memory
General Terms
Performance, Design
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
ISCA’10, June 19–23, 2010, Saint-Malo, France. Copyright 2010 ACM 978-1-4503-0053-7/10/06 …$10.00.
TLB, Memory Management, Page Walk Caching
1. INTRODUCTION
This paper explores the design space of memory-management unit (MMU) caches for accelerating virtual-to-physical address translation in processor architectures, like x86-64, that implement paged virtual memory using a radix tree for their page table. In particular, these caches accelerate the page table walk that occurs after a miss in the Translation Lookaside Buffer (TLB). In fact, a hit in some of these caches enables the processor to skip over one or more levels of the tree and, in the best case, access only the tree’s lowest level.
For several generations of x86 processors, from the Intel 80386 to the Pentium, the page table had at most two levels. Conse- quently, whenever a TLB miss occurred, at most two memory ac- cesses were needed to complete the translation. However, as the physical and virtual address spaces supported by x86 processors have grown in size, the maximum depth of the tree has increased, first to three levels in the Pentium Pro to accommodate a 36-bit physical address within a page table entry, and more recently to four levels in the AMD Opteron to support a 48-bit virtual address space. In fact, with each passing decade since the introduction of the 80386, the depth of the tree has grown by one level.
Recent work has shown the impact of TLB misses on overall system performance ranges from 5-14% for nominally sized ap- plications, even in a non-virtualized environment [9]. As the ap- plication’s memory footprint increases, TLB misses have a signif- icantly larger impact on performance, approaching 50% in some cases [19]. Although the use of large pages can lessen this impact, with further increases in the memory footprint their effectiveness declines. Therefore, both AMD and Intel have implemented MMU caches for page table entries from the higher levels of the tree [3, 9]. However, their caches have quite different structure. For exam- ple, AMD’s Page Walk Cache stores page table entries from any level of the tree, whereas Intel implements distinct caches for each level of the tree. Also, AMD’s Page Walk Cache is indexed by the physical address of the cached page table entry, whereas Intel’s Paging-Structure Caches are indexed by portions of the virtual ad- dress being translated. Thus, in this respect, the Page Walk Cache resembles the processor’s data cache, whereas the Paging-Structure Caches resemble its TLB.
This paper’s primary contribution is that it provides the first comprehensive exploration of the design space occupied by these caches. In total, it discusses five distinct points in this space, in- cluding three new designs. Specifically, it presents the first head- to-head comparison of the effectiveness of these designs. In gen- eral, the results of this comparison show that the translation caches,
. Barr, . Cox,
Rice University Houston, TX
{twb, alc,

which store partial translations and allow the page walk hardware to skip one or more levels of the page table, are the best. In addi- tion, the new translation caches that are introduced by this paper are better than the existing caches in many situations and workloads.
Finally, this paper contributes to the age-old discourse concern- ing the relative effectiveness of different page table organizations. Generally speaking, earlier studies concluded that organizations based on hashing, such as the inverted page table, outperformed organizations based upon radix trees for supporting large virtual address spaces [23, 15]. However, these studies did not take into account the possibility of caching page table entries from the higher levels of the radix tree. This paper shows that radix tables cause up to 20% fewer total memory accesses and up to 400% fewer DRAM accesses than hash-based tables because of the locality in virtual address use.
This paper is organized as follows. The next section provides the background for the rest of the paper. Specifically, it summarizes the relevant aspects of x86/x86-64 virtual-to-physical address transla- tion. Section 3 describes the design space, identifying the essen- tial differences between AMD’s Page Walk Cache, Intel’s Paging- Structure caches, and the new structures developed in this paper. Section 4 qualitatively compares these structures, and Section 5 describes this paper’s methodology for quantitatively comparing them. Section 6 presents quantitative simulation results of their effectiveness as compared to one another, and Section 7 compares the effectiveness of a radix tree page table with these structures to competing page table designs. Additionally, Section 7 examines the interaction between the MMU caches and large pages. Sec- tion 8 discusses the related work. Finally, Section 9 summarizes this paper’s conclusions.
2. X86 ADDRESS TRANSLATION
All x86 processors since the Intel 80386 have used a radix tree to record the mapping from virtual to physical addresses. Although the depth of this tree has increased, to accommodate larger physi- cal and virtual address spaces, the procedure for translating virtual addresses to physical addresses using this tree is essentially un- changed. A virtual address is split into a page number and a page offset. The page number is further split into a sequence of indices. The first index is used to select an entry from the root of the tree, which may contain a pointer to a node at the next lower level of the tree. If the entry does contain a pointer, the next index is used to select an entry from this node, which may again contain a pointer to a node at the next lower level of the tree. These steps repeat until the selected entry is either invalid (in essence, a NULL pointer in- dicating there is no valid translation for that portion of the address space) or the entry instead points to a data page using its physical address. In the latter case, the page offset from the virtual address is added to the physical address of this data page to obtain the full physical address. In a simple memory management unit (MMU) design, this procedure requires one memory access per level in the tree.
Figure 1 shows the precise decomposition of a virtual address by x86-64 processors [1]. Standard x86-64 pages are 4KB, so there is a single 12-bit page offset. The remainder of the 48-bit virtual address is divided into four 9-bit indices, which are used to select entries from the four levels of the page table. The four levels of the x86-64 page table are named PML4 (Page Map Level 4), PDP (Page Directory Pointer), PD (Page Directory) and PT (Page Table). In this paper, however, for clarity, we will refer to these levels as L4 (PML4), L3 (PDP), L2 (PD) and L1 (PT). Finally, the 48-bit virtual address is sign extended to 64 bits. As the virtual address space
63:48 47:39 38:30 29:21 20:12 11:0
se L4 idx L3 idx L2 idx L1 idx page offset
Figure 1: Decomposition of the x86-64 virtual address.
Figure 2: An example page walk for virtual address (0b9, 00c, 0ae, 0c2, 016). Each page table entry stores the physical page number for either the next lower level page ta- ble page (for L4, L3, and L2) or the data page (for L1). Only 12 bits of the 40-bit physical page number are shown in these figures for simplicity.
grows, additional index fields (e.g., L5) may be added, reducing the size of the se field.
An entry in the page table is 8 bytes in size regardless of its level within the tree. Since a 9-bit index is used to select an entry at every level of the tree, the overall size of a node is always 4KB, the same as the page size. Hence, nodes are commonly called page table pages. The tree can be sparsely populated with nodes—if at any level, there are no valid virtual addresses with a particular 9-bit index, the sub-tree beneath that index is not instantiated. For exam- ple, if there are no valid virtual addresses with L4 index 0x03a, that entry in the top level of the page table will indicate so, and the 262,657 page table pages (1 L3 page, 512 L2 pages, and 262,144 L1 pages) beneath that entry in the radix tree page table will not exist. This yields significant memory savings, as large portions of the 256 TB virtual address space are never allocated for typical ap- plications.
Figure 2 illustrates the radix tree page table walk for the vir- tual address 0x00005c8315cc2016. For the remainder of the paper, such 64-bit virtual addresses will be denoted as (L4 index, L3 index, L2 index, L1 index, page offset) for clarity. In this case, the virtual address being translated is (0b9, 00c, 0ae, 0c2, 016). Furthermore, for simplicity of the examples, only 3 hex- adecimal digits (12 bits) will be used to indicate the physical page number, which is actually 40 bits in x86-64 processors.
As shown in the figure, the translation process for this address proceeds as follows. First, the page walk hardware must locate the top-level page table page, which stores L4 entries. The physical address of this page is stored in the processor’s CR3 register. In order to translate the address, the L4 index field (9 bits) is extracted from the virtual address and appended to the physical page number (40 bits) from the CR3 register. This yields a 49-bit physical ad- dress that is used to address the appropriate 8-byte L4 entry (offset 0b9 in the L4 page table page in the figure). The L4 entry may contain the physical page number of an L3 page table page (in this

case 042). The process is repeated by extracting the L3 index field from the virtual address and appending it to this physical page num- ber to address the appropriate L3 entry. This process repeats until the selected entry is invalid or specifies the physical page number of the actual data in memory, as shown in the figure. Each page table entry along this path is highlighted in grey in the figure. The page offset from the virtual address is then appended to this phys- ical page number to yield the data’s physical address. Note that since page table pages are always aligned on page boundaries, the low order bits of the physical address of the page table pages are not stored in the entries of the page table.
Given this structure, the current 48-bit x86-64 virtual address space requires four memory references to “walk” the page table from top to bottom to translate a virtual address (one for each level of the radix tree page table). As the address space continues to grow, more levels will be added to the page table, further increas- ing the cost of address translation. A full 64-bit virtual address space will require six levels, leading to six memory accesses per translation.
Alternatively, an L2 entry can directly point to a contiguous and aligned 2MB data page instead of pointing to an L1 page table page. In Figure 2, virtual address (0b9, 00d, 0bd, 123f5d7) is within a large page. This large-page support greatly increases max- imum TLB coverage. In addition, it lowers the number of memory accesses to locate one of these pages from four to three. Finally, it greatly reduces the number of total page table entries required since each entry maps a much larger region of memory.
3. CACHING PAGE WALKS
While radix-tree page tables require many accesses to translate a single address, the accesses to the upper level page table entries have significant temporal locality. Walks for two consecutive pages in the virtual address space will usually use the same three upper level entries, since the indices selecting these entries come from high-order bits of the virtual address, which change less frequently.
While the MMU does access the page table through the mem- ory hierarchy, it only has access to the L2 data cache in at least one major commercial x86 design [9]. Since the L2 data cache is relatively slow on modern CPUs, accessing three upper-level page table entries on every page walk will incur a penalty of several tens of cycles per TLB miss, even if all entries are present in the L2 data cache.
Therefore, the x86 processor vendors have developed private, low-latency caches for the MMU that store upper level page ta- ble entries [9, 3]. In this section, we describe the design space and provide a nomenclature for the different tagging and partitioning schemes used by these MMU caches.
MMU caches may store elements from the page table tagged by their physical address in memory, as a conventional data cache might. We call such MMU caches page table caches. Examples include AMD’s Page Walk Cache and the L2 data cache, although it is not private to the MMU. Alternatively, MMU caches can be indexed by parts of the virtual address, like a TLB. We call such MMU caches translation caches. Intel’s Paging-Structure Caches are translation caches.
For either of these tagging schemes, elements from different lev- els of the page table can be mixed in a single cache (a unified cache), or placed into separate caches (a split cache). Finally, each cache entry can store an entry from one level along the page walk, or it can store an entire path (a path cache).
3.1 Page table caches
The simplest example of a page table cache is the processor’s L2
Base Location Index
··· ··· ···
Figure 3: An example of the contents of a UPTC. Each entry is tagged with the address of a page table entry, consisting of the 40-bit physical page number of the page table page and a 9-bit index into it. The entry then provides a 40-bit physical page number for the next lower level page table page. (Only 12 bits of the physical page numbers are shown, for simplicity.)
508 125 042
L2 entries
L3 entries
L4 entries
Base Location Index
··· ··· ···
042 00c 125
··· ··· ···
613 0b9 042
· · · · · · · · ·
Figure 4: An example of the contents of a SPTC. Each entry holds the same tag and data as in the UPTC.
data cache. The page walker generates a physical address based upon the page table page to be accessed and an index from the virtual address. This physical address is then fetched from the pro- cessor’s memory hierarchy starting with the L2 data cache.
Page table caches use this same indexing scheme. Elements are tagged with their physical address in the page table. These tags are the size of the physical page number plus the size of one page table index. L1 entries are not cached in any of the designs presented here (since the TLB itself caches those entries).
3.1.1 Unified Page Table Cache (UPTC)
The simplest design for a dedicated page table cache is a sin- gle, high-speed, read-only cache for page table entries, tagged by their physical address in memory. Entries from different levels of the page table are mixed in the same cache, all indexed by their physical address. Such a cache is analogous to a private, read-only L1 data cache for page table entries. However, like a TLB, coher- ence between this cache and the page table can be maintained by software with little overhead. AMD’s Page Walk Cache has this design [9].
Figure 3 shows an example of the Unified Page Table Cache (UPTC) after the MMU walks the page table to translate the virtual address (0b9, 00c, 0ae, 0c2, 016). If the MMU subse- quently tries to translate the virtual address (0b9, 00c, 0ae, 0c3, 103), the page walk will begin by looking for the page table entry 0b9 in the L4 page table page (located at 613 and referenced by the CR3 register). Since this page table entry is present in the UPTC, it does not need to be loaded from the memory hierarchy.
This entry indicates that the L3 page table page has physical page number 042. The same process is then repeated to locate the L2 and L1 page table pages. Once the address of the L1 page table page is found, the appropriate entry is loaded from memory to determine the physical page address of the desired data.
Without a page table cache, all four of these accesses to page table entries would have required a memory access, each of which may or may not hit in the L2 data cache. In contrast, with the page table cache, the three top entries hit in the private page table cache, and only one entry (the L1 entry) requires a memory access, which may or may not hit in the L2 data cache.

L2 entries
L3 entries
L4 entries
L3 index L2 index Next Page
00c 0ae 508
··· ··· ···
These searches can be performed in any order, and even in paral- lel. In the above example, the cache can provide the location of the appropriate L3 and L2 page table pages, but not the L1 page table page, as (0b9, 00c, 0dd) is not present in the L2 entry cache. Ultimately, the MMU would use the (0b9, 00c) entry from the L3 entry cache because it allows the page walk to begin further down the tree, at the L2 page table page.
3.2.2 Unified Translation Cache (UTC)
Just as the page table caches can be built with either a split or a unified organization, a Unified Translation Cache (UTC) can also be built. Moreover, just like the UPTC, the UTC mixes elements from all levels of the page table in the same cache.
Figure 6 shows the UTC after the MMU walks the page table to translate the virtual address (0b9, 00c, 0ae, 0c2, 016). If the MMU subsequently starts to translate the virtual address (0b9, 00c, 0dd, 0c3, 929), it will first look in the UTC for the physical page numbers of the L1, L2 and L3 page table pages. As with the previous example that used the STC, the MMU finds two matching entries in the UTC. Ultimately, the MMU decides to use the UTC’s second entry, which is an L3 entry that has the L4 and L3 indices (0b9, 00c) as its tag, because this tag matches the longest prefix

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com