程序代写代做代考 compiler algorithm cache Java assembly c++ PowerPoint 演示文稿

PowerPoint 演示文稿

CO101
Principle of Computer

Organization
Lecture 02: Performance

Liang Yanyan

澳門科技大學
Macau of University of Science and

Technology

Why We Need to Measure Performance?

• Performance is an important attribute when choosing
computers.

• Salespeople may show you the best light of a computer.
However, is the “best light” accurately reflect the
performance?

• Understanding how best to measure performance and
the limitations of a particular performance measure is
critical in choosing computers.

• This lecture → different metrics of performance measure.

2

Two notions of “performance”

3

Plane

Boeing 747

Concorde

Speed
(mph)

610 mph

1350 mph

DC to Paris

6.5 hours

3 hours

Passengers

470

132

Throughput
(pmph)

286,700

178,200

Which has higher performance?

pmph: passengers miles per hour

Example
• Time of Concorde vs. Boeing 747?

• Concord is 1350 mph / 610 mph = 6.5 hours / 3 hours
= 2.2 “times faster”

• pmph of Concorde vs. Boeing 747 ?
• Boeing is 286,700 pmph / 178,200 pmph = 1.6 “times faster”

• Boeing is 1.6 times faster in terms of pmph
• Concord is 2.2 times faster in terms of flying time

• A problem: which plane is better?
• Need a performance measure when we say someone’s

performance is better.
4

Computer Performance Metrics
• Purchasing perspective

• given a collection of machines, which has the
• best performance?
• least cost?
• best ratio of cost-performance?

• Design perspective
• Faced with design options, which has the

• best performance improvement?
• least cost?
• best ratio of cost-performance?

• Both require
• basis for comparison
• metric for evaluation

• Our goal is to understand what factors in the architecture
contribute to overall system performance and the relative
importance (and cost) of these factors

5

Computer Performance
• Execution time (response time)

• The time between the start and completion of a task (program).

• Throughput
• The total amount of work done in a given time.

• The number of tasks finished in a time period.

• If we upgrade a machine with a faster processor, what
do we increase?

• If we add a new machine to the lab, what do we increase?

6

Computer Performance
• Individual computer user: interested in execution time.

• Care about how long to execute my job?

• Data center manager: interested in throughput
• Care about how many tasks can be done in a time period?

• We will need different performance metrics as well as a
different set of applications to benchmark embedded and
desktop computers, which are more focused on
execution time, versus servers, which are more focused
on throughput.

7

Defining (Speed) Performance
• In discussing the (speed) performance of computers, we

are primarily concerned with execution time. To
maximize performance, need to minimize execution time.

• For some programs running on computer X,

performanceX = 1 / execution_timeX

• Relative performance: If X is n times faster than Y, then

performanceX / performanceY = n

8

Relative Performance Example
• If computer A runs a program in 10 seconds and

computer B runs the same program in 15 seconds, how
much faster is A than B?

• We know that A is n times faster than B if

performanceA
performanceB

=
excution_timeB
excution_timeA

=n

• The performance ratio is 15/10 = 1.5.
• So A is 1.5 times faster than B.

9

Performance Factors
• CPU execution time (CPU time): time the CPU spends

working on a task.
• Does not include time waiting for I/O or running other programs.

CPU execution time
for a program

=
# CPU clock cycles

for a program
×clock cycle time

= #CPU clock cycles for a programclock rate

• We can improve performance by
• reducing the number of clock cycles required for a program;
• reducing the clock cycle time or Increasing clock rate.

10

CPU Clocking
• Operation of digital hardware governed by a constant-

rate clock.

• Clock rate (clock cycles per second in MHz or GHz) is
inverse of clock cycle time (clock period).
• clock rate = 1 / clock cycles time

11

Clock (cycles)

Data transfer
and computation

Update state

Clock period

CPU Clocking
• cycle time (clock period): duration of a clock cycle

• e.g., 250ps = 0.25ns = 250×10–12s

• clock rate (clock frequency): cycles per second
• e.g., 4.0GHz = 4000MHz = 4.0×109Hz

10 nsec clock cycle => 100 MHz clock rate
5 nsec clock cycle => 200 MHz clock rate
2 nsec clock cycle => 500 MHz clock rate
1 nsec (10-9) clock cycle => 1 GHz (109) clock rate
500 psec clock cycle => 2 GHz clock rate
250 psec clock cycle => 4 GHz clock rate
200 psec clock cycle => 5 GHz clock rate

12

Improving Performance Example
• A program runs on computer A with a 2 GHz clock in 10

seconds. We are trying to help a computer designer
build a new machine B, that will run this program in 6
seconds. The designer can use new (or perhaps more
expensive) technology to substantially increase the clock
rate, but has informed us that this increase will affect the
rest of the CPU design, causing machine B to require 1.2
times as many clock cycles as machine A for the same
program. What clock rate should we tell the designer to
target for machine B?

13

Improving Performance Example
CPU timeA=

CPU clock cyclesA
clock rateA

CPU clock cyclesA = CPU timeA × clock rateA

= 10 sec × 2 × 109 cycles/sec
= 20 × 109 cycles

CPU timeB = CPU clock cyclesB / clock rateB

clock rateB = CPU clock cyclesB / CPU timeB
= 1.2 × CPU clock cyclesA / CPU timeB
= 1.2 × 20 × 109 cycles / 6 seconds
= 4 GHz

14

Clock Cycles per Instruction
• Not all instructions take the same amount of time to

execute. Different numbers of cycles for different
instructions.
• Multiplication takes more time than addition.
• Floating point operations take longer than integer ones.
• Accessing memory takes more time than accessing registers.

• One way to think about execution time is that it equals
the number of instructions executed multiplied by the
average time per instruction.

15

CPU clock cycles
for a program

=
# instructions
for a program

×
Average clock cycles

per instrunction

Clock Cycles per Instruction
• Clock cycles per instruction (CPI) – the average number of

clock cycles each instruction takes to execute.
• A way to compare two different implementations of the same ISA.

• Computing the overall effective CPI is done by looking at the
different types of instructions and their individual cycle counts
and averaging.

• A program contains n types of instructions, the CPI of type i is
CPIi, the percentage of the number of instructions of type i is
ICi, the overall effective CPI can be calculated as

overall effective CPI=�CPIi×ICi

n

i=1
• The overall effective CPI varies by instruction mix – a

measure of the dynamic frequency of instructions across one
or many programs.

16

Calculate Performance (example)
A program contains the following instructions:

Instruction Percentage Cycles per Instruction weighted CPI(i)
ALU 50% 1 0.5
Load 20% 5 1.0
Store 10% 3 0.3
Branch 20% 2 0.4

overall effective CPI 2.2

17

Calculate Performance (example 2)
Suppose we have two implementations of the same instruction set
architecture (two machines using the same instruction set architecture).
For some programs,
Machine A has a clock cycle time of 250 ps. and a CPI of 2.0,
Machine B has a clock cycle time of 500 ps. and a CPI of 1.2.
Which machine is faster to run this program, and by how much?

Answer:
Each computer executes the same number of instructions, I, so
Machine A : execution_timeA = I × CPIA × clock cycle timeA
= I × 2.0 × 250 ps = I × 500 ps
Machine B : execution_timeB = I × CPIB × clock cycle timeB
= I × 1.2 × 500 ps = I × 600 ps

Clearly, A is faster … by the ratio of execution times:
performanceA / performanceB = execution_timeB / execution_timeA
= I × 600 ps / I × 500 ps
= 1.2 18

The Performance Equation
• Our basic performance equation is then

CPU execution time
for a program

=
# CPU clock cycles

for a program
×clock cycle time

= instruction count × CPI × clock cycle time
= instruction count × CPI / clock rate
• These equations separate the three key factors that

affect performance
• Can measure the CPU execution time by running the program.
• The clock rate is usually given.
• Can measure overall effective instruction count by using

profilers/simulators without knowing all of the implementation
details.

• CPI varies by instruction type and ISA implementation for which
we must know the implementation details.

19

Determinates of CPU Performance

20

Instruction
count

CPI Clock rate

Programming
language

x x

Compiler x x

ISA x x x

Organization x x

Technology x

Determinates of CPU Performance

• Language : C vs Java
• A same algorithm but written by different languages should have

different programs, assemble programs and different machine
instructions.

• So program language determines instruction count and
CPI.

21

Problem

C program Java program

Assembly
program A

Assembly
program B

Determinates of CPU Performance

• Compiler: Visual studio C++ vs GCC
• A same program but assembled by different compilers should

have different assemble programs and different machine
instructions.

• So compiler determines instruction count and CPI. 22

Algorithm

Compiler A Compiler B

Assembly
program A

Assembly
program B

C program

Determinates of CPU Performance

• ISA: Intel vs Mac
• A program used a same compiler but running at different ISA

machines should have different assemble programs and different
machine instructions.

• Different ISAs have different implementations (CPUs).
• So ISA determines instruction count, CPI and clock rate. 23

Problem

Compiler ISA B

Assembly
Program A

C program

Assembly
Program A

ISA A

Determinates of CPU Performance

24

Problem

Compiler

C program

Assembly
Program

ISA

CPU A

CPU B

CPU: Intel vs AMD
A compiled program running at same ISA but
different CPU machines should have same
assemble programs and same machine
instructions. However, the instructions won’t be
with the same CPI when running at different CPUs.

So CPU determines CPI and clock rate.

Determinates of CPU Performance

25

Problem

Compiler

C program

Assembly
Program

ISA CPU

Material A

Material B

CPU: Intel core i7 (22nm) vs Intel core i7 (14nm)
A compiled program running at same type CPU
machines but manufactured with different technologies
should have same assemble programs and same
machine instructions. And the instructions will be the
same CPI when running at same type CPUs.

So Technology determines clock rate.

Summary
• For a given architecture performance increases come

from:
• use better material to increase clock rate (without increase CPI);
• improvements in processor organization that reduce the CPI;
• compiler enhancements that reduce the CPI and/or instruction

count.

26

Amdahl’s Law
• Improving an aspect of a computer and expecting a

proportional improvement in overall performance

• Example:
“Suppose a program runs in 100 seconds on a machine, with

multiply responsible for 80 seconds of this time. How much do we
have to improve the speed of multiplication if we want the program
to run 4 times faster?“

Exe. Time affected = 80 seconds

Exe. Time unaffected = 100 – 80 = 20 seconds

4 times faster means 100/4=25 seconds, as a result:

25 = 80/Improvement + 20 → Improvement = 16 times
27

unaffected
affected

improved Tfactor timprovemen
T

T +=

Workloads and Benchmarks
• Performance best determined by running a real application

• Use programs typical of expected workload
• Or, typical of expected class of applications
e.g., compilers/editors, scientific applications, graphics, etc.

• Benchmarks – a set of programs that form a “workload”
specifically chosen to measure performance.

• SPEC (System Performance Evaluation Cooperative)
• Companies have agreed on a set of real program and inputs.
• SPEC creates standard sets of benchmarks starting with SPEC89.

The latest is SPEC CPU2006 which consists of 12 integer
benchmarks (CINT2006) and 17 floating-point benchmarks
(CFP2006 www.spec.org).

• There are also benchmark collections for power workloads
(SPECpower_ssj2008), for mail workloads (SPECmail2008), for
multimedia workloads (mediabench), and so on.

28

SPEC Benchmarks

29

Comparing and Summarizing Performance
• How do we summarize the performance for benchmark

set with a single number?
• First the execution times are normalized given the “SPEC ratio”

(bigger is faster, i.e., SPEC ratio is the in.verse of execution time)
• The SPEC ratios are then “averaged” using the geometric mean

(GM).

• Guiding principle in reporting performance
measurements is reproducibility – list everything another
experimenter would need to duplicate the experiment
(version of the operating system, compiler settings, input
set used, specific computer configuration (clock rate,
cache sizes and speed, memory size and speed, etc.))

30

SPEC CINT2006 on AMD Barcelona

31

Uniprocessor Performance

32Constrained by power, instruction-level parallelism, memory latency

CO101�Principle of Computer Organization
Why We Need to Measure Performance?
Two notions of “performance”
Example
Computer Performance Metrics
Computer Performance
Computer Performance
Defining (Speed) Performance
Relative Performance Example
Performance Factors
CPU Clocking
CPU Clocking
Improving Performance Example
Improving Performance Example
Clock Cycles per Instruction
Clock Cycles per Instruction
Calculate Performance (example)
Calculate Performance (example 2)
The Performance Equation
Determinates of CPU Performance
Determinates of CPU Performance
Determinates of CPU Performance
Determinates of CPU Performance
Determinates of CPU Performance
Determinates of CPU Performance
Summary
Amdahl’s Law
Workloads and Benchmarks
SPEC Benchmarks
Comparing and Summarizing Performance
SPEC CINT2006 on AMD Barcelona
Uniprocessor Performance