CS计算机代考程序代写 mips compiler algorithm Digital System Design 4 Lecture 4 – Performance

Digital System Design 4 Lecture 4 – Performance
Dr Stewart Smith & Dr Chang Liu
Stewart Smith Digital Systems Design 4

Performance Criteria
• •
• •
• •

Execution Time
‣ Time to do a particular job
Throughput
‣ Work / Time
Relative Performance
‣ For a fixed task, which does it faster?
Cost Effectiveness
‣ Performance / Price
Power Efficiency
‣ Performance / Power
Multiprocessor Performance
Future lecture
Stewart Smith
Digital Systems Design 4
§1.6 Performance

Execution Time
• •

‣ CPU Time
– Cycles one processor spends working on job – Can be quite misleading…

‣ ‣ ‣ ‣
How long does it take a CPU to do a particular (fixed)
job?
Is CPU doing other work at same time?
Elapsed (‘Wall-clock’) Time
– Includes idle time, I/O, OS overhead – Easy to measure
Problem: Depends on a lot of factors:
Algorithm
Instruction Set Architecture (ISA) Compiler
Processor
Stewart Smith
Digital Systems Design 4

Clocking
Operation of digital hardware governed by a constant- rate clock

Clock (cycles)
Data transfer and computation
Update state
• •


Clock Period
Clock period (clock cycle time): duration of a clock cycle ‣ e.g., 250 ps = 0.25 ns = 250×10–12 s
Clock frequency (clock rate): cycles per second ‣ e.g., 4.0 GHz = 4000 MHz = 4.0×109 Hz
Frequency = 1 / Period
1 / (250×10–12 s) = 4.0×109 Hz (4 GHz)
Stewart Smith
Digital Systems Design 4

CPU Time
Clock Cycle Time =
1
Clock Rate
CPU Time = CPU Clock Cycles ⇥ Clock Cycle Time
CPU Clock Cycles
Clock Rate
=




Performance improved by
Reducing number of clock cycles Increasing clock rate
Hardware designer must often trade off clock rate against cycle count – pipelining
Stewart Smith
Digital Systems Design 4

Instruction Count & CPI (Cycles per Instruction)
Clock Cycles = Instruction Count ⇥ Cycles per Instruction CPU Time = Instruction Count ⇥ CPI ⇥ Clock Cycle Time
Instruction Count ⇥ CPI Clock Rate
=
• •
‣ ‣
Stewart Smith
Instruction Count for a program
‣ Determined by program, ISA and compiler
Average cycles per instruction
Determined by CPU hardware
If different instructions have different CPI
– Average CPI affected by instruction mix Digital Systems Design 4

CPI in More Detail
• •
If different instruction classes take different numbers of cycles
Clock Cycles =
Xn i=1
(CPIi ⇥ Instruction Counti)
Weighted average CPI
CPI = Clock Cycles = Xn ✓CPIi ⇥ Instruction Counti ◆ Instruction Count i=1 Instruction Count
Relative frequency
Stewart Smith
Digital Systems Design 4

CPI Example

Alternative compiled code sequences using instructions in classes A, B, C
Class
A
B
C
CPI for class
1
2
3
IC in sequence 1
2
1
2
IC in sequence 2
4
1
1

Sequence 1: IC = 5 ‣ Clock Cycles
= 2×1 + 1×2 + 2×3
= 10
‣Avg. CPI = 10/5 = 2.0

Sequence 2: IC = 6 ‣ Clock Cycles
= 4×1 + 1×2 + 1×3
= 9
‣Avg. CPI = 9/6 = 1.5
Stewart Smith Digital Systems Design 4

Pitfall: Amdahl’s Law Amdahl’s Law determines the improvement in performance
given an improvement in one aspect of the computer system


Execution time after improvement =
Execution a↵ected by improvement + Execution time una↵ected by improvement Amount of improvement
Suppose a programme takes 100 seconds with 80 seconds used for multiply operations, how much do we need to speed this up to run 5 times faster?
Execution time after improvement = 80 seconds + (100 80) seconds n
20 seconds = 80 seconds + (100 80) seconds n
80 seconds n
0=
Stewart Smith
Digital Systems Design 4

Pitfall: MIPS as a Performance Metric


– –
MIPS: Millions of Instructions Per Second
Doesn’t account for
Differences in ISAs between computers Differences in complexity between instructions
MIPS = =
Instruction Count
Execution Time ⇥ 106 Instruction Count
Instruction Count⇥CPI ⇥ 106 Clock Rate
= Clock Rate CPI ⇥ 106

CPI varies between programs on a given CPU
Stewart Smith
Digital Systems Design 4

SPEC CPU Benchmark
• •


‣ ‣
Programs used to measure performance ‣ Supposedly typical of actual workload
Standard Performance Evaluation Corp (SPEC) ‣ Develops benchmarks for CPU, I/O, Web, …
SPEC CPU2006
Elapsed time to execute a selection of programs – Negligible I/O, so focuses on CPU performance
Normalize relative to reference machine
Summarise as geometric mean of performance ratios – CINT2006 (integer) and CFP2006 (floating-point)
vutYn
n Execution time ratioi
i=1
Digital Systems Design 4
Stewart Smith

CINT2006 for Intel Core i7 920
Name
Description
Instruction Count ×109
CPI
Clock Cycle Time (ns)
Execution time (s)
Reference time (s)
SPECratio
perl
Interpreted string processing
2252
0.60
0.376
508
9770
19.2
bzip2
Block-sorting compression
2390
0.70
0.376
629
9,650
15.4
gcc
GNU C Compiler
794
1.20
0.376
358
8,050
22.5
mcf
Combinatorial optimization
221
2.66
0.376
221
9,120
41.2
go
Go game (AI)
1274
1.10
0.376
527
10,490
19.9
hmmer
Search gene sequence
2616
0.60
0.376
590
9,330
15.8
sjeng
Chess game (AI)
1948
0.80
0.376
586
12,100
20.7
libquantu m
Quantum computer simulation
659
0.44
0.376
109
20,720
190.0
h264avc
Video compression
3793
0.50
0.376
713
22,130
31.0
omnetpp
Discrete event simulation
367
2.10
0.376
290
6,250
21.5
astar
Games/path finding
1250
1.00
0.376
470
7,020
14.9
xalancbmk
XML parsing
1045
0.70
0.376
275
6,900
25.1
Geometric mean
25.7
Stewart Smith Digital Systems Design 4

Performance Summary
CPU Time = Instructions ⇥ Clock Cycles ⇥ Seconds Program Instruction Clock Cycle



Performance depends on
Algorithm: affects IC, possibly CPI Programming language: affects IC, CPI Compiler: affects IC, CPI

‣ Instruction set architecture: affects IC, CPI, Tc
Stewart Smith
Digital Systems Design 4

Cost Effectiveness

Cost / Performance
Stewart Smith Digital Systems Design 4

Relative Performance
• •

‣ ‣

Define Performance = 1/Execution Time “X is n times faster than Y”
PerformanceX / PerformanceY
= Execution timeY / Execution timeX = n
Example: time taken to run a program
10s on A, 15s on B
Execution TimeB / Execution TimeA = 15s / 10s = 1.5
So A is 1.5 times faster than B
Stewart Smith
Digital Systems Design 4

Power Trends
10,000 120 100
1000
Clock Rate 200 66
103 75.3
Power
29.1
95
87
77
2000
3600 2667 3300 3400
12.5 3.3
16 4.1
25 4.9
10.1
80 100 60 40 20
10 10
• In CMOS IC technology
Power = Capacitive load x Voltage2 x Frequency
Stewart Smith
x30 5V ➞ 1V x1000 Digital Systems Design 4
§1.7 The Power Wall
80286 (1982)
80386 (1985)
80486 (1989)
Pentium (1993)
Pentium Pro (1997)
Pentium 4 Willamette (2001)
Pentium 4 Prescott (2004)
Core 2 Kentsfield (2007)
Core i5 Clarkdale
(2010)
Core i5 Ivy Bridge (2012)
Clock Rate (MHz)
Power (watts)

SPEC Power Benchmark

‣ ‣
Power consumption of server at different workload levels
Performance: ssj_ops (operations/second) Power: Watts (J/s)
P10
ssj opsi
i=0 Overall ssj ops per Watt = P
poweri i=0
10
Stewart Smith
Digital Systems Design 4

SPECpower_ssj2008 for Xeon X5650
Target Load %
Performance (ssj_ops)
Average Power (Wa;s)
100%
865,618
258
90%
786,688
242
80%
698,051
224
70%
607,826
204
60%
521,391
185
50%
436,757
170
40%
345,919
157
30%
262,071
146
20%
176,061
135
10%
86,784
121
0%
0
80
Overall sum
4,787,166
1,922
∑ssj_ops/ ∑power =
2,490
Stewart Smith Digital Systems Design 4

Fallacy: Low Power at Idle

‣ ‣ ‣

‣ ‣

Look back at Xeon power benchmark
At 100% load: 258 W
At 50% load: 170 W (66%) At 10% load: 121 W (47%)
Google data center
Mostly operates at 10% – 50% load
At 100% load less than 1% of the time
Consider designing processors to make power proportional to load
Stewart Smith
Digital Systems Design 4

Reducing Power
•Suppose a new CPU has
85% of capacitive load of old CPU
15% voltage and 15% frequency reduction



‣ ‣
Pnew = Cold ⇥0.85⇥(Vold ⇥0.85)2 ⇥Fold ⇥0.85 = 0.854 = 0.52
Pold Cold ⇥V2 ⇥Fold old
The power wall
We can’t reduce voltage further ‣ We can’t remove more heat
How else can we improve performance?
Stewart Smith
Digital Systems Design 4

Uniprocessor Performance
Constrained by power, instruction-
level parallelism, memory latency Digital Systems Design 4
Stewart Smith
§1.8 The Sea Change: The Switch to Multiprocessors
Performance vs. VAX-11/780

Multiprocessors



‣ ‣
Multicore microprocessors
More than one processor per chip Requires explicitly parallel programming
Compare with instruction level parallelism
– Hardware executes multiple instructions at once – Hidden from the programmer
Hard to do
– Programming for performance
– Load balancing
– Optimising communication and synchronisation
Stewart Smith
Digital Systems Design 4

CAUTION

‣ ‣
CPU Benchmarks are not very useful for estimating overall performance
Problems not usually ‘Compute Bound’ Problems usually memory bandwidth bound
Stewart Smith
Digital Systems Design 4

CAUTION
• •
If can’t get data to processor fast enough, processor will sit idle
Problem getting worse, not better!
Stewart Smith
Digital Systems Design 4