Lecture 4 – Performance
Stewart Smith Digital Systems Design 4
Digital System Design 4
Lecture 4 – Performance
Dr Stewart Smith
& Dr Chang Liu
Stewart Smith Digital Systems Design 4
Performance Criteria
• Execution Time
‣ Time to do a particular job
• Throughput
‣ Work / Time
• Relative Performance
‣ For a fixed task, which does it faster?
• Cost Effectiveness
‣ Performance / Price
• Power Efficiency
‣ Performance / Power
• Multiprocessor Performance
‣ Future lecture
§1.6 P
erform
ance
Stewart Smith Digital Systems Design 4
Execution Time
• How long does it take a CPU to do a particular (fixed)
job?
• Is CPU doing other work at same time?
‣ Elapsed (‘Wall-clock’) Time
– Includes idle time, I/O, OS overhead
– Easy to measure
‣ CPU Time
– Cycles one processor spends working on job
– Can be quite misleading…
• Problem: Depends on a lot of factors:
‣ Algorithm
‣ Instruction Set Architecture (ISA)
‣ Compiler
‣ Processor
Stewart Smith Digital Systems Design 4
Clocking
• Operation of digital hardware governed by a constant-
rate clock
• Clock period (clock cycle time): duration of a clock cycle
‣ e.g., 250 ps = 0.25 ns = 250×10–12 s
• Clock frequency (clock rate): cycles per second
‣ e.g., 4.0 GHz = 4000 MHz = 4.0×109 Hz
• Frequency = 1 / Period
‣ 1 / (250×10–12 s) = 4.0×109 Hz (4 GHz)
Clock (cycles)
Data transfer
and computation
Update state
Clock Period
Stewart Smith Digital Systems Design 4
CPU Time = CPU Clock Cycles⇥ Clock Cycle Time
=
CPU Clock Cycles
Clock Rate
CPU Time
• Performance improved by
‣ Reducing number of clock cycles
‣ Increasing clock rate
‣ Hardware designer must often trade off clock
rate against cycle count – pipelining
Clock Cycle Time =
1
Clock Rate
Stewart Smith Digital Systems Design 4
Instruction Count & CPI (Cycles
per Instruction)
• Instruction Count for a program
‣ Determined by program, ISA and compiler
• Average cycles per instruction
‣ Determined by CPU hardware
‣ If different instructions have different CPI
– Average CPI affected by instruction mix
Clock Cycles = Instruction Count⇥ Cycles per Instruction
CPU Time = Instruction Count⇥ CPI⇥ Clock Cycle Time
=
Instruction Count⇥ CPI
Clock Rate
Stewart Smith Digital Systems Design 4
CPI in More Detail
• If different instruction classes take different
numbers of cycles
• Weighted average CPI
Relative frequency
Clock Cycles =
nX
i=1
(CPIi ⇥ Instruction Counti)
CPI =
Clock Cycles
Instruction Count
=
nX
i=1
✓
CPIi ⇥
Instruction Counti
Instruction Count
◆
Stewart Smith Digital Systems Design 4
CPI Example
• Alternative compiled code sequences using
instructions in classes A, B, C
Class A B C
CPI for class 1 2 3
IC in sequence 1 2 1 2
IC in sequence 2 4 1 1
• Sequence 1: IC = 5
‣ Clock Cycles
= 2×1 + 1×2 + 2×3
= 10
‣ Avg. CPI = 10/5 = 2.0
• Sequence 2: IC = 6
‣ Clock Cycles
= 4×1 + 1×2 + 1×3
= 9
‣ Avg. CPI = 9/6 = 1.5
Stewart Smith Digital Systems Design 4
Pitfall: Amdahl’s Law
• Amdahl’s Law determines the improvement in performance
given an improvement in one aspect of the computer system
• Suppose a programme takes 100 seconds with 80 seconds
used for multiply operations, how much do we need to speed
this up to run 5 times faster?
Execution time after improvement =
Execution a↵ected by improvement
Amount of improvement
+ Execution time una↵ected by improvement
Execution time after improvement =
80 seconds
n
+ (100� 80) seconds
20 seconds =
80 seconds
n
+ (100� 80) seconds
0 =
80 seconds
n
Stewart Smith Digital Systems Design 4
Pitfall: MIPS as a Performance
Metric
• MIPS: Millions of Instructions Per Second
‣ Doesn’t account for
– Differences in ISAs between computers
– Differences in complexity between instructions
• CPI varies between programs on a given CPU
MIPS =
Instruction Count
Execution Time⇥ 106
=
Instruction Count
Instruction Count⇥CPI
Clock Rate
⇥ 106
=
Clock Rate
CPI⇥ 106
Stewart Smith Digital Systems Design 4
SPEC CPU Benchmark
• Programs used to measure performance
‣ Supposedly typical of actual workload
• Standard Performance Evaluation Corp (SPEC)
‣ Develops benchmarks for CPU, I/O, Web, …
• SPEC CPU2006
‣ Elapsed time to execute a selection of programs
– Negligible I/O, so focuses on CPU performance
‣ Normalize relative to reference machine
‣ Summarise as geometric mean of performance ratios
– CINT2006 (integer) and CFP2006 (floating-point)
n
vuut
nY
i=1
Execution time ratioi
Stewart Smith Digital Systems Design 4
CINT2006 for Intel Core i7 920
Name Description Instruction Count ×109 CPI
Clock Cycle
Time (ns)
Execution
time (s)
Reference
time (s) SPECratio
perl Interpreted string processing 2252 0.60 0.376 508 9770 19.2
bzip2 Block-sorting compression 2390 0.70 0.376 629 9,650 15.4
gcc GNU C Compiler 794 1.20 0.376 358 8,050 22.5
mcf Combinatorial optimization 221 2.66 0.376 221 9,120 41.2
go Go game (AI) 1274 1.10 0.376 527 10,490 19.9
hmmer Search gene sequence 2616 0.60 0.376 590 9,330 15.8
sjeng Chess game (AI) 1948 0.80 0.376 586 12,100 20.7
libquantu
m
Quantum computer
simulation 659 0.44 0.376 109 20,720 190.0
h264avc Video compression 3793 0.50 0.376 713 22,130 31.0
omnetpp Discrete event simulation 367 2.10 0.376 290 6,250 21.5
astar Games/path finding 1250 1.00 0.376 470 7,020 14.9
xalancbmk XML parsing 1045 0.70 0.376 275 6,900 25.1
Geometric mean 25.7
Stewart Smith Digital Systems Design 4
Performance Summary
• Performance depends on
‣ Algorithm: affects IC, possibly CPI
‣ Programming language: affects IC, CPI
‣ Compiler: affects IC, CPI
‣ Instruction set architecture: affects IC, CPI, Tc
CPU Time =
Instructions
Program
⇥
Clock Cycles
Instruction
⇥
Seconds
Clock Cycle
Stewart Smith Digital Systems Design 4
Cost Effectiveness
• Cost / Performance
Stewart Smith Digital Systems Design 4
Relative Performance
• Define Performance = 1/Execution Time
• “X is n times faster than Y”
• Example: time taken to run a program
‣ 10s on A, 15s on B
‣ Execution TimeB / Execution TimeA
= 15s / 10s = 1.5
‣ So A is 1.5 times faster than B
PerformanceX / PerformanceY
= Execution timeY / Execution timeX = n
Stewart Smith Digital Systems Design 4
Power = Capacitive load x Voltage2 x Frequency
Power Trends
• In CMOS IC technology
§1.7 The P
ow
er W
all
x30
2667 3300 3400
12.5 16
2000
200
66
25
3600
75.3
95
87
77
29.1
10.14.94.13.3
103
1
10
100
1000
10,000
80
28
6
(1
98
2)
80
38
6
(1
98
5)
80
48
6
(1
98
9)
P
en
tiu
m
(1
99
3)
P
en
tiu
m
P
ro
(
19
97
)
P
en
tiu
m
4
W
ill
am
et
te
(2
00
1)
P
en
tiu
m
4
P
re
sc
ot
t
(2
00
4)
C
or
e
2
K
en
ts
fie
ld
(2
00
7)
C
lo
ck
R
at
e
(M
H
z)
0
20
40
60
80
100
120
P
ow
er
(
w
at
ts
)
Clock Rate
Power
C
or
e
i5
C
la
rk
da
le
(2
01
0)
C
or
e
i5
Iv
y
B
rid
ge
(2
01
2)
5V ➞ 1V x1000
Stewart Smith Digital Systems Design 4
SPEC Power Benchmark
• Power consumption of server at different
workload levels
‣ Performance: ssj_ops (operations/second)
‣ Power: Watts (J/s)
Overall ssj ops per Watt =
10P
i=0
ssj opsi
10P
i=0
poweri
Stewart Smith Digital Systems Design 4
SPECpower_ssj2008 for Xeon X5650
Target Load % Performance (ssj_ops) Average Power (Wa;s)
100% 865,618 258
90% 786,688 242
80% 698,051 224
70% 607,826 204
60% 521,391 185
50% 436,757 170
40% 345,919 157
30% 262,071 146
20% 176,061 135
10% 86,784 121
0% 0 80
Overall sum 4,787,166 1,922
∑ssj_ops/ ∑power = 2,490
Stewart Smith Digital Systems Design 4
Fallacy: Low Power at Idle
• Look back at Xeon power benchmark
‣ At 100% load: 258 W
‣ At 50% load: 170 W (66%)
‣ At 10% load: 121 W (47%)
• Google data center
‣ Mostly operates at 10% – 50% load
‣ At 100% load less than 1% of the time
• Consider designing processors to make
power proportional to load
Stewart Smith Digital Systems Design 4
Reducing Power
• Suppose a new CPU has
‣ 85% of capacitive load of old CPU
‣ 15% voltage and 15% frequency reduction
• The power wall
‣ We can’t reduce voltage further
‣ We can’t remove more heat
• How else can we improve performance?
Pnew
Pold
=
Cold ⇥ 0.85⇥ (Vold ⇥ 0.85)2 ⇥ Fold ⇥ 0.85
Cold ⇥ V 2old ⇥ Fold
= 0.854 = 0.52
Stewart Smith Digital Systems Design 4
Uniprocessor Performance
§1.8 The S
ea C
hange: The S
w
itch to M
ultiprocessors
Constrained by power, instruction-
level parallelism, memory latency
Pe
rf
or
m
an
ce
v
s.
V
A
X
-1
1/
78
0
Stewart Smith Digital Systems Design 4
Multiprocessors
• Multicore microprocessors
‣ More than one processor per chip
• Requires explicitly parallel programming
‣ Compare with instruction level parallelism
– Hardware executes multiple instructions at once
– Hidden from the programmer
‣ Hard to do
– Programming for performance
– Load balancing
– Optimising communication and synchronisation
Stewart Smith Digital Systems Design 4
CAUTION
• CPU Benchmarks are not very useful for
estimating overall performance
‣ Problems not usually ‘Compute Bound’
‣ Problems usually memory bandwidth bound
Stewart Smith Digital Systems Design 4
CAUTION
• If can’t get data to processor fast enough,
processor will sit idle
• Problem getting worse, not better!