CS计算机代考程序代写 mips compiler ER algorithm Lecture 4 – Performance

Lecture 4 – Performance

Stewart Smith Digital Systems Design 4

Digital System Design 4
Lecture 4 – Performance

Dr Stewart Smith
& Dr Chang Liu

Stewart Smith Digital Systems Design 4

Performance Criteria
• Execution Time
‣ Time to do a particular job
• Throughput
‣ Work / Time
• Relative Performance
‣ For a fixed task, which does it faster?
• Cost Effectiveness
‣ Performance / Price
• Power Efficiency
‣ Performance / Power
• Multiprocessor Performance
‣ Future lecture

§1.6 P
erform

ance

Stewart Smith Digital Systems Design 4

Execution Time
• How long does it take a CPU to do a particular (fixed)

job?
• Is CPU doing other work at same time?
‣ Elapsed (‘Wall-clock’) Time
– Includes idle time, I/O, OS overhead
– Easy to measure

‣ CPU Time
– Cycles one processor spends working on job
– Can be quite misleading…

• Problem: Depends on a lot of factors:
‣ Algorithm
‣ Instruction Set Architecture (ISA)
‣ Compiler
‣ Processor

Stewart Smith Digital Systems Design 4

Clocking
• Operation of digital hardware governed by a constant-

rate clock

• Clock period (clock cycle time): duration of a clock cycle
‣ e.g., 250 ps = 0.25 ns = 250×10–12 s
• Clock frequency (clock rate): cycles per second
‣ e.g., 4.0 GHz = 4000 MHz = 4.0×109 Hz
• Frequency = 1 / Period
‣ 1 / (250×10–12 s) = 4.0×109 Hz (4 GHz)

Clock (cycles)
Data transfer
and computation
Update state

Clock Period

Stewart Smith Digital Systems Design 4

CPU Time = CPU Clock Cycles⇥ Clock Cycle Time

=
CPU Clock Cycles

Clock Rate

CPU Time

• Performance improved by
‣ Reducing number of clock cycles
‣ Increasing clock rate
‣ Hardware designer must often trade off clock

rate against cycle count – pipelining

Clock Cycle Time =
1

Clock Rate

Stewart Smith Digital Systems Design 4

Instruction Count & CPI (Cycles
per Instruction)

• Instruction Count for a program
‣ Determined by program, ISA and compiler
• Average cycles per instruction
‣ Determined by CPU hardware
‣ If different instructions have different CPI
– Average CPI affected by instruction mix

Clock Cycles = Instruction Count⇥ Cycles per Instruction
CPU Time = Instruction Count⇥ CPI⇥ Clock Cycle Time

=
Instruction Count⇥ CPI

Clock Rate

Stewart Smith Digital Systems Design 4

CPI in More Detail
• If different instruction classes take different

numbers of cycles

• Weighted average CPI

Relative frequency

Clock Cycles =
nX

i=1

(CPIi ⇥ Instruction Counti)

CPI =
Clock Cycles

Instruction Count
=

nX

i=1


CPIi ⇥

Instruction Counti
Instruction Count

Stewart Smith Digital Systems Design 4

CPI Example
• Alternative compiled code sequences using

instructions in classes A, B, C

Class A B C
CPI for class 1 2 3
IC in sequence 1 2 1 2
IC in sequence 2 4 1 1

• Sequence 1: IC = 5
‣ Clock Cycles

= 2×1 + 1×2 + 2×3
= 10
‣ Avg. CPI = 10/5 = 2.0

• Sequence 2: IC = 6
‣ Clock Cycles

= 4×1 + 1×2 + 1×3
= 9
‣ Avg. CPI = 9/6 = 1.5

Stewart Smith Digital Systems Design 4

Pitfall: Amdahl’s Law
• Amdahl’s Law determines the improvement in performance

given an improvement in one aspect of the computer system

• Suppose a programme takes 100 seconds with 80 seconds
used for multiply operations, how much do we need to speed
this up to run 5 times faster?

Execution time after improvement =

Execution a↵ected by improvement

Amount of improvement
+ Execution time una↵ected by improvement

Execution time after improvement =
80 seconds

n
+ (100� 80) seconds

20 seconds =
80 seconds

n
+ (100� 80) seconds

0 =
80 seconds

n

Stewart Smith Digital Systems Design 4

Pitfall: MIPS as a Performance
Metric

• MIPS: Millions of Instructions Per Second
‣ Doesn’t account for
– Differences in ISAs between computers
– Differences in complexity between instructions

• CPI varies between programs on a given CPU

MIPS =
Instruction Count

Execution Time⇥ 106

=
Instruction Count

Instruction Count⇥CPI
Clock Rate

⇥ 106
=

Clock Rate

CPI⇥ 106

Stewart Smith Digital Systems Design 4

SPEC CPU Benchmark
• Programs used to measure performance
‣ Supposedly typical of actual workload
• Standard Performance Evaluation Corp (SPEC)
‣ Develops benchmarks for CPU, I/O, Web, …
• SPEC CPU2006
‣ Elapsed time to execute a selection of programs
– Negligible I/O, so focuses on CPU performance

‣ Normalize relative to reference machine
‣ Summarise as geometric mean of performance ratios
– CINT2006 (integer) and CFP2006 (floating-point)

n

vuut
nY

i=1

Execution time ratioi

Stewart Smith Digital Systems Design 4

CINT2006 for Intel Core i7 920
Name Description Instruction Count ×109 CPI

Clock Cycle
Time (ns)

Execution
time (s)

Reference
time (s) SPECratio

perl Interpreted string processing 2252 0.60 0.376 508 9770 19.2

bzip2 Block-sorting compression 2390 0.70 0.376 629 9,650 15.4

gcc GNU C Compiler 794 1.20 0.376 358 8,050 22.5

mcf Combinatorial optimization 221 2.66 0.376 221 9,120 41.2

go Go game (AI) 1274 1.10 0.376 527 10,490 19.9
hmmer Search gene sequence 2616 0.60 0.376 590 9,330 15.8
sjeng Chess game (AI) 1948 0.80 0.376 586 12,100 20.7
libquantu
m

Quantum computer
simulation 659 0.44 0.376 109 20,720 190.0

h264avc Video compression 3793 0.50 0.376 713 22,130 31.0

omnetpp Discrete event simulation 367 2.10 0.376 290 6,250 21.5

astar Games/path finding 1250 1.00 0.376 470 7,020 14.9
xalancbmk XML parsing 1045 0.70 0.376 275 6,900 25.1
Geometric mean 25.7

Stewart Smith Digital Systems Design 4

Performance Summary

• Performance depends on
‣ Algorithm: affects IC, possibly CPI
‣ Programming language: affects IC, CPI
‣ Compiler: affects IC, CPI
‣ Instruction set architecture: affects IC, CPI, Tc

CPU Time =
Instructions

Program

Clock Cycles

Instruction

Seconds

Clock Cycle

Stewart Smith Digital Systems Design 4

Cost Effectiveness
• Cost / Performance

Stewart Smith Digital Systems Design 4

Relative Performance
• Define Performance = 1/Execution Time
• “X is n times faster than Y”

• Example: time taken to run a program
‣ 10s on A, 15s on B
‣ Execution TimeB / Execution TimeA

= 15s / 10s = 1.5

‣ So A is 1.5 times faster than B

PerformanceX / PerformanceY
= Execution timeY / Execution timeX = n

Stewart Smith Digital Systems Design 4

Power = Capacitive load x Voltage2 x Frequency

Power Trends

• In CMOS IC technology

§1.7 The P
ow

er W
all

x30

2667 3300 3400

12.5 16

2000

200
66

25

3600

75.3
95

87
77

29.1
10.14.94.13.3

103

1

10

100

1000

10,000

80
28

6
(1

98
2)

80
38

6
(1

98
5)

80
48

6
(1

98
9)

P
en

tiu
m

(1
99

3)

P
en

tiu
m

P
ro

(
19

97
)

P
en

tiu
m

4
W

ill
am

et
te

(2
00

1)
P

en
tiu

m
4

P
re

sc
ot

t
(2

00
4)

C
or

e
2

K
en

ts
fie

ld
(2

00
7)

C
lo

ck
R

at
e

(M
H

z)

0

20

40

60

80

100

120

P
ow

er
(

w
at

ts
)

Clock Rate

Power

C
or

e
i5

C
la

rk
da

le

(2
01

0)
C

or
e

i5
Iv

y
B

rid
ge

(2
01

2)

5V ➞ 1V x1000

Stewart Smith Digital Systems Design 4

SPEC Power Benchmark

• Power consumption of server at different
workload levels
‣ Performance: ssj_ops (operations/second)
‣ Power: Watts (J/s)

Overall ssj ops per Watt =

10P
i=0

ssj opsi

10P
i=0

poweri

Stewart Smith Digital Systems Design 4

SPECpower_ssj2008 for Xeon X5650
Target Load % Performance (ssj_ops) Average Power (Wa;s)
100% 865,618 258
90% 786,688 242
80% 698,051 224
70% 607,826 204
60% 521,391 185
50% 436,757 170
40% 345,919 157
30% 262,071 146
20% 176,061 135
10% 86,784 121
0% 0 80
Overall sum 4,787,166 1,922
∑ssj_ops/ ∑power = 2,490

Stewart Smith Digital Systems Design 4

Fallacy: Low Power at Idle
• Look back at Xeon power benchmark
‣ At 100% load: 258 W
‣ At 50% load: 170 W (66%)
‣ At 10% load: 121 W (47%)

• Google data center
‣ Mostly operates at 10% – 50% load
‣ At 100% load less than 1% of the time

• Consider designing processors to make
power proportional to load

Stewart Smith Digital Systems Design 4

Reducing Power
• Suppose a new CPU has
‣ 85% of capacitive load of old CPU
‣ 15% voltage and 15% frequency reduction

• The power wall
‣ We can’t reduce voltage further
‣ We can’t remove more heat
• How else can we improve performance?

Pnew
Pold

=
Cold ⇥ 0.85⇥ (Vold ⇥ 0.85)2 ⇥ Fold ⇥ 0.85

Cold ⇥ V 2old ⇥ Fold
= 0.854 = 0.52

Stewart Smith Digital Systems Design 4

Uniprocessor Performance
§1.8 The S

ea C
hange: The S

w
itch to M

ultiprocessors

Constrained by power, instruction-
level parallelism, memory latency

Pe
rf
or

m
an

ce
v

s.
V

A
X

-1
1/

78
0

Stewart Smith Digital Systems Design 4

Multiprocessors
• Multicore microprocessors
‣ More than one processor per chip

• Requires explicitly parallel programming
‣ Compare with instruction level parallelism
– Hardware executes multiple instructions at once
– Hidden from the programmer

‣ Hard to do
– Programming for performance
– Load balancing
– Optimising communication and synchronisation

Stewart Smith Digital Systems Design 4

CAUTION
• CPU Benchmarks are not very useful for

estimating overall performance
‣ Problems not usually ‘Compute Bound’
‣ Problems usually memory bandwidth bound

Stewart Smith Digital Systems Design 4

CAUTION

• If can’t get data to processor fast enough,
processor will sit idle

• Problem getting worse, not better!