PowerPoint 演示文稿
CO101
Principle of Computer
Organization
Lecture 02: Performance
Liang Yanyan
澳門科技大學
Macau of University of Science and
Technology
Why We Need to Measure Performance?
• Performance is an important attribute when choosing
computers.
• Salespeople may show you the best light of a computer.
However, is the “best light” accurately reflect the
performance?
• Understanding how best to measure performance and
the limitations of a particular performance measure is
critical in choosing computers.
• This lecture → different metrics of performance measure.
2
Two notions of “performance”
3
Plane
Boeing 747
Concorde
Speed
(mph)
610 mph
1350 mph
DC to Paris
6.5 hours
3 hours
Passengers
470
132
Throughput
(pmph)
286,700
178,200
Which has higher performance?
pmph: passengers miles per hour
Example
• Time of Concorde vs. Boeing 747?
• Concord is 1350 mph / 610 mph = 6.5 hours / 3 hours
= 2.2 “times faster”
• pmph of Concorde vs. Boeing 747 ?
• Boeing is 286,700 pmph / 178,200 pmph = 1.6 “times faster”
• Boeing is 1.6 times faster in terms of pmph
• Concord is 2.2 times faster in terms of flying time
• A problem: which plane is better?
• Need a performance measure when we say someone’s
performance is better.
4
Computer Performance Metrics
• Purchasing perspective
• given a collection of machines, which has the
• best performance?
• least cost?
• best ratio of cost-performance?
• Design perspective
• Faced with design options, which has the
• best performance improvement?
• least cost?
• best ratio of cost-performance?
• Both require
• basis for comparison
• metric for evaluation
• Our goal is to understand what factors in the architecture
contribute to overall system performance and the relative
importance (and cost) of these factors
5
Computer Performance
• Execution time (response time)
• The time between the start and completion of a task (program).
• Throughput
• The total amount of work done in a given time.
• The number of tasks finished in a time period.
• If we upgrade a machine with a faster processor, what
do we increase?
• If we add a new machine to the lab, what do we increase?
6
Computer Performance
• Individual computer user: interested in execution time.
• Care about how long to execute my job?
• Data center manager: interested in throughput
• Care about how many tasks can be done in a time period?
• We will need different performance metrics as well as a
different set of applications to benchmark embedded and
desktop computers, which are more focused on
execution time, versus servers, which are more focused
on throughput.
7
Defining (Speed) Performance
• In discussing the (speed) performance of computers, we
are primarily concerned with execution time. To
maximize performance, need to minimize execution time.
• For some programs running on computer X,
performanceX = 1 / execution_timeX
• Relative performance: If X is n times faster than Y, then
performanceX / performanceY = n
8
Relative Performance Example
• If computer A runs a program in 10 seconds and
computer B runs the same program in 15 seconds, how
much faster is A than B?
• We know that A is n times faster than B if
performanceA
performanceB
=
excution_timeB
excution_timeA
=n
• The performance ratio is 15/10 = 1.5.
• So A is 1.5 times faster than B.
9
Performance Factors
• CPU execution time (CPU time): time the CPU spends
working on a task.
• Does not include time waiting for I/O or running other programs.
CPU execution time
for a program
=
# CPU clock cycles
for a program
×clock cycle time
= #CPU clock cycles for a programclock rate
• We can improve performance by
• reducing the number of clock cycles required for a program;
• reducing the clock cycle time or Increasing clock rate.
10
CPU Clocking
• Operation of digital hardware governed by a constant-
rate clock.
• Clock rate (clock cycles per second in MHz or GHz) is
inverse of clock cycle time (clock period).
• clock rate = 1 / clock cycles time
11
Clock (cycles)
Data transfer
and computation
Update state
Clock period
CPU Clocking
• cycle time (clock period): duration of a clock cycle
• e.g., 250ps = 0.25ns = 250×10–12s
• clock rate (clock frequency): cycles per second
• e.g., 4.0GHz = 4000MHz = 4.0×109Hz
10 nsec clock cycle => 100 MHz clock rate
5 nsec clock cycle => 200 MHz clock rate
2 nsec clock cycle => 500 MHz clock rate
1 nsec (10-9) clock cycle => 1 GHz (109) clock rate
500 psec clock cycle => 2 GHz clock rate
250 psec clock cycle => 4 GHz clock rate
200 psec clock cycle => 5 GHz clock rate
12
Improving Performance Example
• A program runs on computer A with a 2 GHz clock in 10
seconds. We are trying to help a computer designer
build a new machine B, that will run this program in 6
seconds. The designer can use new (or perhaps more
expensive) technology to substantially increase the clock
rate, but has informed us that this increase will affect the
rest of the CPU design, causing machine B to require 1.2
times as many clock cycles as machine A for the same
program. What clock rate should we tell the designer to
target for machine B?
13
Improving Performance Example
CPU timeA=
CPU clock cyclesA
clock rateA
CPU clock cyclesA = CPU timeA × clock rateA
= 10 sec × 2 × 109 cycles/sec
= 20 × 109 cycles
CPU timeB = CPU clock cyclesB / clock rateB
clock rateB = CPU clock cyclesB / CPU timeB
= 1.2 × CPU clock cyclesA / CPU timeB
= 1.2 × 20 × 109 cycles / 6 seconds
= 4 GHz
14
Clock Cycles per Instruction
• Not all instructions take the same amount of time to
execute. Different numbers of cycles for different
instructions.
• Multiplication takes more time than addition.
• Floating point operations take longer than integer ones.
• Accessing memory takes more time than accessing registers.
• One way to think about execution time is that it equals
the number of instructions executed multiplied by the
average time per instruction.
15
CPU clock cycles
for a program
=
# instructions
for a program
×
Average clock cycles
per instrunction
Clock Cycles per Instruction
• Clock cycles per instruction (CPI) – the average number of
clock cycles each instruction takes to execute.
• A way to compare two different implementations of the same ISA.
• Computing the overall effective CPI is done by looking at the
different types of instructions and their individual cycle counts
and averaging.
• A program contains n types of instructions, the CPI of type i is
CPIi, the percentage of the number of instructions of type i is
ICi, the overall effective CPI can be calculated as
overall effective CPI=�CPIi×ICi
n
i=1
• The overall effective CPI varies by instruction mix – a
measure of the dynamic frequency of instructions across one
or many programs.
16
Calculate Performance (example)
A program contains the following instructions:
Instruction Percentage Cycles per Instruction weighted CPI(i)
ALU 50% 1 0.5
Load 20% 5 1.0
Store 10% 3 0.3
Branch 20% 2 0.4
overall effective CPI 2.2
17
Calculate Performance (example 2)
Suppose we have two implementations of the same instruction set
architecture (two machines using the same instruction set architecture).
For some programs,
Machine A has a clock cycle time of 250 ps. and a CPI of 2.0,
Machine B has a clock cycle time of 500 ps. and a CPI of 1.2.
Which machine is faster to run this program, and by how much?
Answer:
Each computer executes the same number of instructions, I, so
Machine A : execution_timeA = I × CPIA × clock cycle timeA
= I × 2.0 × 250 ps = I × 500 ps
Machine B : execution_timeB = I × CPIB × clock cycle timeB
= I × 1.2 × 500 ps = I × 600 ps
Clearly, A is faster … by the ratio of execution times:
performanceA / performanceB = execution_timeB / execution_timeA
= I × 600 ps / I × 500 ps
= 1.2 18
The Performance Equation
• Our basic performance equation is then
CPU execution time
for a program
=
# CPU clock cycles
for a program
×clock cycle time
= instruction count × CPI × clock cycle time
= instruction count × CPI / clock rate
• These equations separate the three key factors that
affect performance
• Can measure the CPU execution time by running the program.
• The clock rate is usually given.
• Can measure overall effective instruction count by using
profilers/simulators without knowing all of the implementation
details.
• CPI varies by instruction type and ISA implementation for which
we must know the implementation details.
19
Determinates of CPU Performance
20
Instruction
count
CPI Clock rate
Programming
language
x x
Compiler x x
ISA x x x
Organization x x
Technology x
Determinates of CPU Performance
• Language : C vs Java
• A same algorithm but written by different languages should have
different programs, assemble programs and different machine
instructions.
• So program language determines instruction count and
CPI.
21
Problem
C program Java program
Assembly
program A
Assembly
program B
Determinates of CPU Performance
• Compiler: Visual studio C++ vs GCC
• A same program but assembled by different compilers should
have different assemble programs and different machine
instructions.
• So compiler determines instruction count and CPI. 22
Algorithm
Compiler A Compiler B
Assembly
program A
Assembly
program B
C program
Determinates of CPU Performance
• ISA: Intel vs Mac
• A program used a same compiler but running at different ISA
machines should have different assemble programs and different
machine instructions.
• Different ISAs have different implementations (CPUs).
• So ISA determines instruction count, CPI and clock rate. 23
Problem
Compiler ISA B
Assembly
Program A
C program
Assembly
Program A
ISA A
Determinates of CPU Performance
24
Problem
Compiler
C program
Assembly
Program
ISA
CPU A
CPU B
CPU: Intel vs AMD
A compiled program running at same ISA but
different CPU machines should have same
assemble programs and same machine
instructions. However, the instructions won’t be
with the same CPI when running at different CPUs.
So CPU determines CPI and clock rate.
Determinates of CPU Performance
25
Problem
Compiler
C program
Assembly
Program
ISA CPU
Material A
Material B
CPU: Intel core i7 (22nm) vs Intel core i7 (14nm)
A compiled program running at same type CPU
machines but manufactured with different technologies
should have same assemble programs and same
machine instructions. And the instructions will be the
same CPI when running at same type CPUs.
So Technology determines clock rate.
Summary
• For a given architecture performance increases come
from:
• use better material to increase clock rate (without increase CPI);
• improvements in processor organization that reduce the CPI;
• compiler enhancements that reduce the CPI and/or instruction
count.
26
Amdahl’s Law
• Improving an aspect of a computer and expecting a
proportional improvement in overall performance
• Example:
“Suppose a program runs in 100 seconds on a machine, with
multiply responsible for 80 seconds of this time. How much do we
have to improve the speed of multiplication if we want the program
to run 4 times faster?“
Exe. Time affected = 80 seconds
Exe. Time unaffected = 100 – 80 = 20 seconds
4 times faster means 100/4=25 seconds, as a result:
25 = 80/Improvement + 20 → Improvement = 16 times
27
unaffected
affected
improved Tfactor timprovemen
T
T +=
Workloads and Benchmarks
• Performance best determined by running a real application
• Use programs typical of expected workload
• Or, typical of expected class of applications
e.g., compilers/editors, scientific applications, graphics, etc.
• Benchmarks – a set of programs that form a “workload”
specifically chosen to measure performance.
• SPEC (System Performance Evaluation Cooperative)
• Companies have agreed on a set of real program and inputs.
• SPEC creates standard sets of benchmarks starting with SPEC89.
The latest is SPEC CPU2006 which consists of 12 integer
benchmarks (CINT2006) and 17 floating-point benchmarks
(CFP2006 www.spec.org).
• There are also benchmark collections for power workloads
(SPECpower_ssj2008), for mail workloads (SPECmail2008), for
multimedia workloads (mediabench), and so on.
28
SPEC Benchmarks
29
Comparing and Summarizing Performance
• How do we summarize the performance for benchmark
set with a single number?
• First the execution times are normalized given the “SPEC ratio”
(bigger is faster, i.e., SPEC ratio is the in.verse of execution time)
• The SPEC ratios are then “averaged” using the geometric mean
(GM).
• Guiding principle in reporting performance
measurements is reproducibility – list everything another
experimenter would need to duplicate the experiment
(version of the operating system, compiler settings, input
set used, specific computer configuration (clock rate,
cache sizes and speed, memory size and speed, etc.))
30
SPEC CINT2006 on AMD Barcelona
31
Uniprocessor Performance
32Constrained by power, instruction-level parallelism, memory latency
CO101�Principle of Computer Organization
Why We Need to Measure Performance?
Two notions of “performance”
Example
Computer Performance Metrics
Computer Performance
Computer Performance
Defining (Speed) Performance
Relative Performance Example
Performance Factors
CPU Clocking
CPU Clocking
Improving Performance Example
Improving Performance Example
Clock Cycles per Instruction
Clock Cycles per Instruction
Calculate Performance (example)
Calculate Performance (example 2)
The Performance Equation
Determinates of CPU Performance
Determinates of CPU Performance
Determinates of CPU Performance
Determinates of CPU Performance
Determinates of CPU Performance
Determinates of CPU Performance
Summary
Amdahl’s Law
Workloads and Benchmarks
SPEC Benchmarks
Comparing and Summarizing Performance
SPEC CINT2006 on AMD Barcelona
Uniprocessor Performance