代写代考 SPEC13 as our performance-measuring tool. In addition, the database contain

CPU DB: Recording Microprocessor History
With this open database, you can mine microprocessor trends over the past 40 years.
, , , . Stevenson, , Stanford University
In November 1971, Intel introduced the world’s first single-chip microprocessor, the Intel 4004.

Copyright By PowCoder代写 加微信 powcoder

It had 2,300 transistors, ran at a clock speed of up to 740 KHz, and delivered 60,000 instructions
per second while dissipating 0.5 watts. The following four decades witnessed exponential growth
in compute power, a trend that has enabled applications as diverse as climate modeling, protein folding, and computing real-time ballistic trajectories of angry birds. Today’s microprocessor chips employ billions of transistors, include multiple processor cores on a single silicon die, run at clock speeds measured in gigahertz, and deliver more than 4 million times the performance of the original 4004.
Where did these incredible gains come from? This article sheds some light on this question by introducing CPU DB (cpudb.stanford.edu), an open and extensible database collected by Stanford’s VLSI (very large-scale integration) Research Group over several generations of processors (and students). We gathered information on commercial processors from 17 manufacturers and placed it in CPU DB, which now contains data on 790 processors spanning the past 40 years.
In addition, we provide a methodology to separate the effect of technology scaling from improvements on other frontiers (e.g., architecture and software), allowing the comparison of machines built in different technologies. To demonstrate the utility of this data and analysis, we use it to decompose processor improvements into contributions from the physical scaling of devices, and from improvements in microarchitecture, compiler, and software technologies.
AN OPEN REPOSITORY OF PROCESSOR SPECS
While information about current processors is easy to find, it is rarely arranged in a manner that is useful to the research community. For example, the data sheet may contain the processor’s power, voltage, frequency, and cache size, but not the pipeline depth or the technology minimum feature size. Even then, these specifications often fail to tell the full story: a laptop processor operates over a range of frequencies and voltages, not just the 2 GHz shown on the box label.
Not surprisingly, specification data gets harder to find the older the processor becomes, especially for those that are no longer made, or worse, whose manufacturers no longer exist. We have been collecting this type of data for three decades and are now releasing it in the form of an open repository of processor specifications. The goal of CPU DB is to aggregate detailed processor specifications into a convenient form and to encourage community participation, both to leverage this information and to keep it accurate and current. CPU DB (cpudb. stanford.edu) is populated with desktop, laptop, and server processors, for which we use SPEC13 as our performance-measuring tool. In addition, the database contains limited data on embedded cores, for which we are using the CoreMark benchmark for performance.5 With time and help from the community, we hope to extend the coverage of embedded processors in the database.
PROCESSORS

TABLE 1: Categories used to organize per-processor specifications in CPU DB.
PROCESSORS
Processor architecture and microarchitecture
Memory system
Physical characteristics
Technology
Summary Parameter
Architecture family
Last level cache
Vdd nominal
Clock frequency TDP
Process size
Parameters
Manufacturer Family name Code name Model name Date released Number of cores Threads per core Word size
L1 data size
L1 instruction size L2 size
Memory bandwidth FSB pins
Memory pins Power and ground pins I/O pins
Nominal frequency Turbo frequency
Low power frequency TDP
Number of transistors
Process name
Process type
Feature size
Effective channel length Number of metal layers Metal type
For users to analyze different processor features, CPU DB contains many data entries for each CPU, ranging from physical parameters such as number of metal layers, to overall performance metrics such as SPEC scores. To make viewing relevant data easier, the database includes summary fields, such as nominal clock frequency, that try to represent more detailed scaling data. Table 1 shows the current list of CPU DB parameters. Table 2 summarizes the “microarchitecture” specifications.
All high-performance processors today tell the system what supply voltage they need within a range of allowable values. This makes it difficult to track how power-supply voltage has scaled over time. Instead of relying on the specified worst-case behavior, researchers are free to analyze the power, frequency, and voltage that a processor actually uses while running an application, and then add it to the CPU DB repository. Table 3 is a summary of the measured parameters tracked in CPU DB.
TABLE 2: Microarchitectural parameters contained in CPU DB.
Manufacturer
Microarchitecture
ISA version
ISA extensions
Floating point pipe stages
Integer pipe stages
Max uOps issued per cycle
Integer functional units
Load store functional units
Floating point functional units
Total functional units
Max instructions de- coded per cycle
Reorder buffer
Instruction window size
Instruction fetch queue size
Branch history table
Branch target buffer
Branch predictor accuracy
Integer registers
Floating point registers
Total registers
Floating point coproc.
TLB entries
Out of order
Integrated mem. controller
TABLE 3: Measured parameters in CPU DB.*
Power for specified load, Idle power, Max operating power
Vdd for specified load, Vdd idle, Vdd at max power
Performance
SPECRate 2006, SPEC 2006, SPEC 2000, SPEC 1995,SPEC 1992, MIPs
• Note: Spec benchmarks also include comprehensive fields for performance on individual spec subtests.

While CPU DB includes a large set of processor data fields, certain members of the architecture community will likely want to explore data fields that we did not think to include. To handle such situations, users are encouraged to suggest new data columns. These suggestions will be reviewed and then entered in the database.
A similar system helps keep CPU DB accurate and up to date. Users can submit data for new processors and architectures, and suggest corrections to data entries. We understand that users may not have data for all of the specifications, and we encourage users to submit any subsets of the data fields. New data and corrections will be reviewed before being applied to the database.
With these mechanisms for adding and vetting data, CPU DB will be a powerful tool for architects who wish to incorporate processor data into their studies. Because many database users will probably want to perform analyses on the raw CPU DB data, the full database is downloadable in comma- separated value format.
TECHNOLOGY NORMALIZATION METHODOLOGY
CPU DB allows side-by-side access to performance data for relatively simple in-order processors
(up to the mid-1990s) and modern out-of-order processors. One could ask if, at the cost of lower performance, the simplicity of the older designs conferred an efficiency advantage. Unfortunately, direct comparisons using the raw data are difficult because, over the years, manufacturing technologies have improved significantly. A fair comparison would be possible if both processors were manufactured using the same process; but since porting all of these older processors to modern technologies is not feasible, we need another approach. To enable such comparisons, we instead estimate how processor performance and power would scale with technology.
Our main performance metric is based on industry-standard SPEC CPU2006 scores.13 Unfortunately, most older processors did not run SPEC 2006 and instead measured performance
in MIPS (million instructions per second) and, later, in terms of SPEC 1989, SPEC 1992, SPEC 1995, and SPEC 2000. In those cases we estimate SPEC 2006 numbers by converting old scores into a SPEC 2006 equivalent score using a conversion factor. The conversion values are determined by examining systems that have scores for two versions of SPEC and then taking the geometric mean of the set of ratios between overlapping scores. This method was used to create the summary performance scores in the database. We also provide the raw scores so that users can develop better conversion methods over time.
To estimate the performance of a processor if it were manufactured using a newer process, we calculate the clock frequency in that technology using gate-delay data. While the speed of the cache memory on the processor scales with technology, the delay going to main memory has scaled only slowly with time. As a result, doubling the clock frequency generally does not double the processor’s performance. We finesse this issue the same way the microprocessor industry does: by scaling the on-chip cache so the percentage memory stall time remains constant. Using the empirical rule that miss rates are proportional to the square root of the cache size,9,14 we expand the last-level cache by four times for each doubling of clock frequency. Thus, we assume that the processor performance scales with clock frequency, but we penalize the energy and area of the processor by growing its cache.
For the clock-cycle time estimate, we need to know how the delays of the gates and wires will scale. Fortunately, the delay scaling of different logic gates is similar, so it is sufficient to measure
PROCESSORS

how the delay of a single gate scales. Our analysis uses the delay of an inverter driving four equivalent inverters (a fanout of four, or FO4) as the gate-speed metric. Inverters are the most common gate type, and their delay is often published in technology papers. For wire delay it is important to remember that a design’s area will shrink with scaling, so its wire delay will, in general, reduce slowly or, at worst, stay constant. Its effect on cycle time depends on the internal circuit design. Designers generally pipeline long wires, so they tend not to limit the critical path. Thus, we ignore wire delay and make the slightly optimistic assumption that a processor’s frequency in the new technology will be greater by the ratio of FO4s from old to new:
昀 㴀昀 䘀伀㐀㄀ (ᄀ) ㄀ 䘀伀㐀(ᄀ)
Using FO4 as a basic metric has an additional advantage: it cleanly covers the performance/energy variation that comes from changing the supply voltage. Two processors, even built in the same technology, might be operated at different supply voltages. The energy difference between the two can be calculated directly from the supply voltage, but the voltage’s effect on performance is harder to estimate. Using FO4 data for these designs at two different voltages provides all the information that is needed.
Having accounted for the effect of the scaled memory systems, we find that estimating the power of a processor with scaled technology is fairly straightforward. Processor power has two components: dynamic and leakage. In an optimized design, the leakage power is around 30 percent of the dynamic power, and the leakage power will scale as the dynamic power scales.16
Dynamic power is given by the product of the processor’s average activity factor,  (the probability that a node will switch each cycle), the processor frequency, and the energy to switch the transistors:
Energy=C(Vdd )2
The processor’s average activity factor depends on the logic and not the technology, so it is constant with scaling. Since capacitance per unit length is roughly constant with scaling, C should be proportional to the feature size  We have already estimated how the frequency will scale, so the estimated power and performance scaling for technology is:
P=PλV2FO4 +P 22 11
2 1 λV2FOO4 cache 11 2
Perf = Perf FOFO41 f2 f1 FOFO42
For analyzing processor efficiency, it is often better to look at energy per operation rather than power. Energy/op factors out the linear relationship that both performance and power have with
PROCESSORS

frequency (FO4). Lowering the frequency changes the power but does not change the energy/op. Since energy/op is proportional to the ratio of power over performance, we derive the next next equation by dividing the previous two:
攀 渀 攀 爀 最 礀 ∝ 倀 딀 ㄀
氀  嘀V (ᄀ) 2 倀
(ᄀ)2 (ᄀ)2 ⬀ 挀愀挀栀攀
䘀 伀 㐀 昀氀嘀昀㄀
With these expressions, it is possible to normalize CPU DB processors’ performance and energy into a single process technology. While Intel’s et al. gave a rough sketch of how technology scaling and architectural improvement contributed to processor performance over the years,2 our data and normalization method can be used to generate an actual scatter plot showing the breakdown between the two factors: faster transistors (resulting from technology scaling) and architectural improvement. As seen in figure 1, process scaling and microarchitectural scaling each contribute nearly the same amount to processor performance gains.
As a quick sanity check for our normalization results, we plot normalized performance versus transistor count and normalized area in figures 2 and 3. These plots look at Pollack’s rule, which states that performance scales as the square root of design complexity.1 Pollack’s rule has been used in numerous published studies to compare performance against processor die resource usage.2,4,10,15
P倀e攀r爀f 1 ㄀1㄀1 1㄀
PROCESSORS
Processor Performance Improvements Over Time
0.18 0.13 0.9
0.065 0.045
performance / performance of 386 FO4 of 386 / FO4
feature size (um)
All processors are normalized to the performance of the Intel 386. The squares indicate how processor performance actu- ally scaled with time, while the diamonds denote how much speedup came from improving the manufacturing process.

PROCESSORS
Pollack’s Rule Using CPU-DB: Performance vs. Transistor Count * 100
* The regression yields Perff norm
0.370.37 ∝αn n
norm trantsrans
transistor count (relative to 386)
Pollack’s Rule Using CPU-DB: Performance vs. Normalized Area* 100
* The regression yields Perff
normalized core area (relative to area of 386)
∝αn tArans nonromrm norm
normalized performance(relative to performance of 386) normalized performance(relative to performance of 386)

Figures 2 and 3 show that our normalized data is in close agreement with Pollack’s rule, suggesting that our normalization method accurately represents design performance.
PHYSICAL SCALING
One of the nice side benefits of collecting this database is that it allows one to see how chip complexity, voltage, and power have scaled over time, and how well scaling predictions compare with reality. The rate of feature scaling has accelerated in recent years (figure 4). Up through the 130 nm (nanometer) process generation, feature size scaled down by a factor of
roughly every two to three years. Since the 90 nm generation, however, a new process has been introduced roughly every two years. Intel appears to be driving this intense schedule and has been one of the first to market for each process since the 180 nm generation.
As a result of this exponential scaling, in the 25 years since the release of the Intel 80386, transistor area has shrunk by a factor of almost 4,000. If feature size scaling were all that were driving processor density, then transistor counts would have scaled by the same rate. An analysis of commercial microprocessors, however, shows that transistor count has actually grown by a factor of 16,000.
One simple reason why transistor growth has outpaced feature size is that processor dies have
PROCESSORS
Scaling of Transistor Feature Sizes Over Time
0.68 um 0.50 um
0.35 um 0.25 um 0.18 um 0.13 um
90 nm 65 nm 45 nm 32 nm
1995 2000 2005
Up to the 130 nm node, feature size scaled every two to three years. Since the 90 nm generation, feature size scaling has accelerated to every two years.
feature size

grown. While the 80386 microprocessor had a die size of 103 mm2, modern Intel Core i7 dies
have an area of up to 296 mm2. This is not the whole story behind transistor scaling, however. Figure 5 shows technology-independent transistor density by plotting how many square minimum features an average processor transistor occupies. We generated this data by taking the die area, dividing by the feature size squared, and then dividing by the number of transistors. From 1985
to 2005 increasing metal layers and larger cache structures (with their high transistor densities) had decreased the average size of a transistor by four times. Interestingly, since 2005, transistor density actually dropped by roughly a factor of two. While our data does not indicate a reason for this change, we suspect it results from a combination of stricter design rules for sub-wavelength lithography, using more robust logic styles in the processor, and a shrinking percentage of the processor area used for cache in chip multiprocessors.
Our data also provides some interesting insight into how supply voltages have scaled over time. Most people know voltage scales with technology feature size, so many assume that this scaling
is proportional to feature size as originally proposed in ’s 1974 article.6 As he and others have noted, however, and as shown in figure 6, voltage has not scaled at the same pace as feature size.3,12 Until roughly the 0.6 μm node, processors maintained an operating voltage of 5 volts, since that was the common supply voltage for popular logic families of the day, and processor power dissipation was not an issue. It was not until manufacturers went to 3.3 volts in the 0.6 μm generation that voltage began to scale with feature size. Fitting a curve on the voltage data from the
PROCESSORS
Number of Square Features Per Transistor Over Time
1995 2000 2005
Note that features per transistor fell until about 2004, indicating a growth in technology-independent transistor density. In modern chips, transistors have started to grow.
features per transistor

half-micron to the 0.13 μm process generations, our data indicates that, even when voltage scaled, it did so with roughly the square root of feature size. This slower scaling has been attributed to reaping a dual benefit of faster gates and better immunity to noise and process variations at the cost of higher chip-power density.
From the 0.13 μm generation on, voltage scaling seems to have slowed. At the same time, however, trends in voltage have become much harder to estimate from our data. As mentioned earlier, today almost all processors define their own operating voltage. The data sheets have only the operating range. Figure 6 plots the maximum specified voltage. More user data should provide insight on how supply voltages are really scaling.
CIRCUITS AND PIPELINING
Circuit designers and microarchitects were not content to scale frequency with gate speed—if they had been, then microprocessors would be running at only around 500 MHz today. As figure 7 shows, frequencies scaled much faster than simple gate speed. The reason for this discrepancy is largely because of architectural decisions that decreased the logic depth in each processor pipeline stage
and increased the number of stages. From 1985 to around 2000, the frequency rapidly increased as
a result of faster, more parallel circuit implementations of adders, branch units, and caches, and the use of aggressive pipelining. These trends are evident in the contrast between the two-stage fetch/ execute pipeline of the Intel 80386, and the 30-plus pi

程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com