CS代写 THAN 15,000 COMMODITY- CLASS PCS WITH FAULT-TOLERANT SOFTWARE. THIS ARCHITE

WEB SEARCH FOR A PLANET: THE GOOGLE CLUSTER ARCHITECTURE
AMENABLE TO EXTENSIVE PARALLELIZATION, GOOGLE’S WEB SEARCH APPLICATION LETS DIFFERENT QUERIES RUN ON DIFFERENT PROCESSORS AND, BY PARTITIONING THE OVERALL INDEX, ALSO LETS A SINGLE QUERY USE MULTIPLE PROCESSORS. TO HANDLE THIS WORKLOAD, GOOGLE’S ARCHITECTURE FEATURES CLUSTERS OF MORE THAN 15,000 COMMODITY- CLASS PCS WITH FAULT-TOLERANT SOFTWARE. THIS ARCHITECTURE ACHIEVES SUPERIOR PERFORMANCE AT A FRACTION OF THE COST OF A SYSTEM BUILT FROM FEWER, BUT MORE EXPENSIVE, HIGH-END SERVERS.
́ Hölzle Google
Few Web services require as much computation per request as search engines. On average, a single query on Google reads hundreds of megabytes of data and consumes tens of billions of CPU cycles. Supporting a peak request stream of thousands of queries per second requires an infrastructure compa- rable in size to that of the largest supercom- puter installations. Combining more than 15,000 commodity-class PCs with fault-tol- erant software creates a solution that is more cost-effective than a comparable system built out of a smaller number of high-end servers.

Copyright By PowCoder代写 加微信 powcoder

Here we present the architecture of the Google cluster, and discuss the most important factors that influence its design: energy effi- ciency and price-performance ratio. Energy efficiency is key at our scale of operation, as power consumption and cooling issues become significant operational factors, taxing the lim-
its of available data center power densities. Our application affords easy parallelization: Different queries can run on different proces- sors, and the overall index is partitioned so that a single query can use multiple proces- sors. Consequently, peak processor perfor- mance is less important than its price/ performance. As such, Google is an example of a throughput-oriented workload, and should benefit from processor architectures that offer more on-chip parallelism, such as simultaneous multithreading or on-chip mul-
tiprocessors.
Google architecture overview
Google’s software architecture arises from two basic insights. First, we provide reliability in software rather than in server-class hard- ware, so we can use commodity PCs to build a high-end computing cluster at a low-end
Published by the IEEE Computer Society
0272-1732/03/$17.00  2003 IEEE

price. Second, we tailor the design for best aggregate request throughput, not peak server response time, since we can manage response times by parallelizing individual requests.
We believe that the best price/performance tradeoff for our applications comes from fash- ioning a reliable computing infrastructure from clusters of unreliable commodity PCs. We provide reliability in our environment at the software level, by replicating services across many different machines and automatically detecting and handling failures. This software- based reliability encompasses many different areas and involves all parts of our system design. Examining the control flow in han- dling a query provides insight into the high- level structure of the query-serving system, as well as insight into reliability considerations.
Serving a Google query
When a user enters a query to Google (such as www.google.com/search?q=ieee+society), the user’s browser first performs a domain name sys- tem (DNS) lookup to map www.google.com to a particular IP address. To provide sufficient capacity to handle query traffic, our service con- sists of multiple clusters distributed worldwide. Each cluster has around a few thousand machines, and the geographically distributed setup protects us against catastrophic data cen- ter failures (like those arising from earthquakes and large-scale power failures). A DNS-based load-balancing system selects a cluster by accounting for the user’s geographic proximity to each physical cluster. The load-balancing sys- tem minimizes round-trip time for the user’s request, while also considering the available capacity at the various clusters.
The user’s browser then sends a hypertext transport protocol (HTTP) request to one of these clusters, and thereafter, the processing of that query is entirely local to that cluster. A hardware-based load balancer in each clus- ter monitors the available set of Google Web servers (GWSs) and performs local load bal- ancing of requests across a set of them. After receiving a query, a GWS machine coordi- nates the query execution and formats the results into a Hypertext Markup Language (HTML) response to the user’s browser. Fig- ure 1 illustrates these steps.
Query execution consists of two major phases.1 In the first phase, the index servers
Google Web server
Spell checker
Document servers
Index servers
Figure 1. Google query-serving architecture.
consult an inverted index that maps each query word to a matching list of documents (the hit list). The index servers then determine a set of relevant documents by intersecting the hit lists of the individual query words, and they compute a relevance score for each doc- ument. This relevance score determines the order of results on the output page.
The search process is challenging because of the large amount of data: The raw documents comprise several tens of terabytes of uncom- pressed data, and the inverted index resulting from this raw data is itself many terabytes of data. Fortunately, the search is highly paral- lelizable by dividing the index into pieces (index shards), each having a randomly chosen subset of documents from the full index. A pool of machines serves requests for each shard, and the overall index cluster contains one pool for each shard. Each request chooses a machine within a pool using an intermediate load bal- ancer—in other words, each query goes to one machine (or a subset of machines) assigned to each shard. If a shard’s replica goes down, the load balancer will avoid using it for queries, and other components of our cluster-man- agement system will try to revive it or eventu- ally replace it with another machine. During the downtime, the system capacity is reduced in proportion to the total fraction of capacity that this machine represented. However, ser- vice remains uninterrupted, and all parts of the index remain available.
The final result of this first phase of query execution is an ordered list of document iden- tifiers (docids). As Figure 1 shows, the second
MARCH–APRIL 2003 23

GOOGLE SEARCH ARCHITECTURE
phase involves taking this list of docids and computing the actual title and uniform resource locator of these documents, along with a query-specific document summary. Document servers (docservers) handle this job, fetching each document from disk to extract the title and the keyword-in-context snippet. As with the index lookup phase, the strategy is to partition the processing of all documents by
• randomly distributing documents into smaller shards
• having multiple server replicas responsi- ble for handling each shard, and
• routing requests through a load balancer.
The docserver cluster must have access to an online, low-latency copy of the entire Web. In fact, because of the replication required for performance and availability, Google stores dozens of copies of the Web across its clusters.
In addition to the indexing and document- serving phases, a GWS also initiates several other ancillary tasks upon receiving a query, such as sending the query to a spell-checking system and to an ad-serving system to gener- ate relevant advertisements (if any). When all phases are complete, a GWS generates the appropriate HTML for the output page and returns it to the user’s browser.
Using replication for capacity and fault-tolerance
We have structured our system so that most accesses to the index and other data structures involved in answering a query are read-only: Updates are relatively infrequent, and we can often perform them safely by diverting queries away from a service replica during an update. This principle sidesteps many of the consis- tency issues that typically arise in using a gen- eral-purpose database.
We also aggressively exploit the very large amounts of inherent parallelism in the appli- cation: For example, we transform the lookup of matching documents in a large index into many lookups for matching documents in a set of smaller indices, followed by a relatively inexpensive merging step. Similarly, we divide the query stream into multiple streams, each handled by a cluster. Adding machines to each pool increases serving capacity, and adding shards accommodates index growth. By par-
allelizing the search over many machines, we reduce the average latency necessary to answer a query, dividing the total computation across more CPUs and disks. Because individual shards don’t need to communicate with each other, the resulting speedup is nearly linear. In other words, the CPU speed of the indi- vidual index servers does not directly influ- ence the search’s overall performance, because we can increase the number of shards to accommodate slower CPUs, and vice versa. Consequently, our hardware selection process focuses on machines that offer an excellent request throughput for our application, rather than machines that offer the highest single- thread performance.
In summary, Google clusters follow three key design principles:
• Softwarereliability.Weeschewfault-tol- erant hardware features such as redun- dant power supplies, a redundant array of inexpensive disks (RAID), and high- quality components, instead focusing on tolerating failures in software.
• Use replication for better request through- put and availability. Because machines are inherently unreliable, we replicate each of our internal services across many machines. Because we already replicate services across multiple machines to obtain sufficient capacity, this type of fault tolerance almost comes for free.
• Price/performancebeatspeakperformance. We purchase the CPU generation that currently gives the best performance per unit price, not the CPUs that give the best absolute performance.
• Using commodity PCs reduces the cost of computation. As a result, we can afford to use more computational resources per query, employ more expensive techniques in our ranking algorithm, or search a larger index of documents.
Leveraging commodity parts
Google’s racks consist of 40 to 80 x86-based servers mounted on either side of a custom made rack (each side of the rack contains twenty 20u or forty 1u servers). Our focus on price/performance favors servers that resemble mid-range desktop PCs in terms of their com- ponents, except for the choice of large disk
IEEE MICRO

drives. Several CPU generations are in active service, ranging from single-processor 533- MHz Intel-Celeron-based servers to dual 1.4- GHz Intel Pentium III servers. Each server contains one or more integrated drive elec- tronics (IDE) drives, each holding 80 Gbytes. Index servers typically have less disk space than document servers because the former have a more CPU-intensive workload. The servers on each side of a rack interconnect via a 100-Mbps Ethernet switch that has one or two gigabit uplinks to a core gigabit switch that connects all racks together.
Our ultimate selection criterion is cost per query, expressed as the sum of capital expense (with depreciation) and operating costs (host- ing, system administration, and repairs) divid- ed by performance. Realistically, a server will not last beyond two or three years, because of its disparity in performance when compared to newer machines. Machines older than three years are so much slower than current-gener- ation machines that it is difficult to achieve proper load distribution and configuration in clusters containing both types. Given the rel- atively short amortization period, the equip- ment cost figures prominently in the overall cost equation.
Because Google servers are custom made, we’ll use pricing information for comparable PC-based server racks for illustration. For example, in late 2002 a rack of 88 dual-CPU 2-GHz Intel Xeon servers with 2 Gbytes of RAM and an 80-Gbyte hard disk was offered on RackSaver.com for around $278,000. This figure translates into a monthly capital cost of $7,700 per rack over three years. Personnel and hosting costs are the remaining major contributors to overall cost.
The relative importance of equipment cost makes traditional server solutions less appeal- ing for our problem because they increase per- formance but decrease the price/performance. For example, four-processor motherboards are expensive, and because our application paral- lelizes very well, such a motherboard doesn’t recoup its additional cost with better perfor- mance. Similarly, although SCSI disks are faster and more reliable, they typically cost two or three times as much as an equal-capac- ity IDE drive.
The cost advantages of using inexpensive, PC-based clusters over high-end multi-
processor servers can be quite substantial, at least for a highly parallelizable application like ours. The example $278,000 rack contains 176 2-GHz Xeon CPUs, 176 Gbytes of RAM, and 7 Tbytes of disk space. In com- parison, a typical x86-based server contains eight 2-GHz Xeon CPUs, 64 Gbytes of RAM, and 8 Tbytes of disk space; it costs about $758,000.2 In other words, the multiproces- sor server is about three times more expensive but has 22 times fewer CPUs, three times less RAM, and slightly more disk space. Much of the cost difference derives from the much higher interconnect bandwidth and reliabili- ty of a high-end server, but again, Google’s highly redundant architecture does not rely on either of these attributes.
Operating thousands of mid-range PCs instead of a few high-end multiprocessor servers incurs significant system administra- tion and repair costs. However, for a relative- ly homogenous application like Google, where most servers run one of very few appli- cations, these costs are manageable. Assum- ing tools to install and upgrade software on groups of machines are available, the time and cost to maintain 1,000 servers isn’t much more than the cost of maintaining 100 servers because all machines have identical configu- rations. Similarly, the cost of monitoring a cluster using a scalable application-monitor- ing system does not increase greatly with clus- ter size. Furthermore, we can keep repair costs reasonably low by batching repairs and ensur- ing that we can easily swap out components with the highest failure rates, such as disks and power supplies.
The power problem
Even without special, high-density packag- ing, power consumption and cooling issues can become challenging. A mid-range server with dual 1.4-GHz Pentium III processors draws about 90 W of DC power under load: roughly 55 W for the two CPUs, 10 W for a disk drive, and 25 W to power DRAM and the mother- board. With a typical efficiency of about 75 per- cent for an ATX power supply, this translates into 120 W of AC power per server, or rough- ly 10 kW per rack. A rack comfortably fits in 25 ft2 of space, resulting in a power density of 400 W/ft2. With higher-end processors, the power density of a rack can exceed 700 W/ft2.
MARCH–APRIL 2003 25

GOOGLE SEARCH ARCHITECTURE
Table 1. Instruction-level measurements on the index server.
Hardware-level application characteristics
Examining various architectural characteris- tics of our application helps illustrate which hardware platforms will provide the best price/performance for our query-serving sys- tem. We’ll concentrate on the characteristics of the index server, the component of our infra- structure whose price/performance most heav- ily impacts overall price/performance. The main activity in the index server consists of decoding compressed information in the inverted index and finding matches against a set of documents that could satisfy a query. Table 1 shows some basic instruction-level measurements of the index server program running on a 1-GHz dual- processor Pentium III system.
The application has a moderately high CPI, considering that the Pentium III is capable of issuing three instructions per cycle. We expect such behavior, considering that the applica- tion traverses dynamic data structures and that control flow is data dependent, creating a sig- nificant number of difficult-to-predict branches. In fact, the same workload running on the newer Pentium 4 processor exhibits nearly twice the CPI and approximately the same branch prediction performance, even though the Pentium 4 can issue more instruc- tions concurrently and has superior branch prediction logic. In essence, there isn’t that much exploitable instruction-level parallelism (ILP) in the workload. Our measurements suggest that the level of aggressive out-of- order, speculative execution present in mod- ern processors is already beyond the point of diminishing performance returns for such programs.
A more profitable way to exploit parallelism for applications such as the index server is to leverage the trivially parallelizable computa- tion. Processing each query shares mostly read- only data with the rest of the system, and constitutes a work unit that requires little com- munication. We already take advantage of that at the cluster level by deploying large numbers of inexpensive nodes, rather than fewer high- end ones. Exploiting such abundant thread- level parallelism at the microarchitecture level appears equally promising. Both simultaneous multithreading (SMT) and chip multiproces- sor (CMP) architectures target thread-level parallelism and should improve the perfor- mance of many of our servers. Some early
Characteristic
Cycles per instruction Ratios (percentage)
Branch mispredict
Level 1 instruction miss* Level 1 data miss*
Level 2 miss*
Instruction TLB miss*
Data TLB miss*
* Cache and TLB ratios are per instructions retired.
5.0 0.4 0.7 0.3 0.04 0.7
Unfortunately, the typical power density for commercial data centers lies between 70 and 150 W/ft2, much lower than that required for PC clusters. As a result, even low-tech PC clusters using relatively straightforward pack- aging need special cooling or additional space to bring down power density to that which is tolerable in typical data centers. Thus, pack- ing even more servers into a rack could be of limited practical use for large-scale deploy- ment as long as such racks reside in standard data centers. This situation leads to the ques- tion of whether it is possible to reduce the power usage per server.
Reduced-power servers are attractive for large-scale clusters, but you must keep some caveats in mind. First, reduced power is desir- able, but, for our application, it must come without a corresponding performance penal- ty: What counts is watts per unit of perfor- mance, not watts alone. Second, the lower-power server must not be considerably more expensive, because the cost of deprecia- tion typically outweighs the cost of power. The earlier-mentioned 10 kW rack consumes about 10 MW-h of power per month (includ- ing cooling overhead). Even at a generous 15 cents per kilowatt-hour (half for the actual power, half to amortize uninterruptible power supply [UPS] and power distribution equip- ment), power and cooling cost only $1,500 per month. Such a cost is small in compari- son to the depreciation cost of $7,700 per month. Thus, low-power servers must not be more expensive than regular servers to have an overall cost advantage in our setup.
IEEE MICRO

experiments with a dual-context (SMT) Intel Xeon processor show more than a 30 percent performance improvement over a single-con- text setup. This speedup is at the upper bound of improvements reported by Intel for their SMT implementation.3
We believe that the potential for CMP sys- tems is even greater. CMP designs, such as Hydra4 and Piranha,5 seem especially promis- ing. In these designs, multiple (four to eight) simpler, in-order, short-pipeline cores replace a complex high-performa

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com