代写 C algorithm math MIPS parallel compiler database graph software network GPU theory slides are adapted from CA course of wisc, princeton, mit, berkeley, etc.

slides are adapted from CA course of wisc, princeton, mit, berkeley, etc.
The uses of the slides of this course are for educaonal purposes only and should be
used only in conjuncon with the textbook. Derivaves of the slides must
acknowledge the copyright noces of this and the originals.
1

Which of the following airplanes has the best performance?
Airplane Passengers
Range mi
630 4150 4000 8720
Speed mph
598
610 1350 544
Boeing 737100 Boeing 747 BACSud Concorde Douglas DC850
101 470 132 146
How much faster is the Concorde vs. the 747
How much bigger is the 747 vs. DC8?
1

Which computer is fastest?
2

Which computer is fastest?
The answer is not simple:
Scientific simulation FP performance
Program development Integer performance Database workload Memory, IO
3

Want to buy the fastest computer for what you want to do?
Workload is very important
Correct measurement and analysis method
4

Want to buy the fastest computer for what you want to do?
Workload is very important
Correct measurement and analysis method
Want to design the fastest computer for what the customer wants to pay?
Cost is an important criterion
5

Time and performance
Iron Law Time
MIPS and MFLOPS
Which programs and how to average Amdahls law
6

?
What is important to whom?
7

?
What is important to whom?
Computer system user
Minimize elapsed time for program:
tresp tend tstart
Called response time
8

?
What is important to whom?
Computer system user
Minimize elapsed time for program:
tresp tend tstart
Called response time
Computer center manager
Maximize completion rate jobssecond Called throughput
9

Response Time vs. Throughput
Is throughput 1avg. response time?
10

Response Time vs. Throughput
Is throughput 1avg. response time?
Only if NO overlap
Otherwise, throughput 1avg. response time
11

Response Time vs. Throughput
Is throughput 1avg. response time?
Only if NO overlap
Otherwise, throughput 1avg. response time
:A lunch buffet Assume 5 entrees
Each person takes 2 minutesentree
Throughput is 1 person every 2 minutes
BUT time to fill up tray is 10 minutes
Why and what would the throughput be otherwise?
5 people simultaneously filling tray overlap Without overlap, throughput 110
12

?
For computer architects
CPU time time spent running a program
Intuitively, bigger should be faster, so:
Performance 1X time, where X is response, CPU
execution, etc.
Elapsed time CPU time IO wait We will concentrate on CPU time
13

?
Improve a response time or b throughput? Faster CPU
Add more CPUs
14

?
Improve a response time or b throughput? Faster CPU
Helps both a and b Add more CPUs
15

?
Improve a response time or b throughput? Faster CPU
Helps both a and b Add more CPUs
Helps b and perhaps a due to less queueing
16

?
MachineAisntimesfasterthanmachineBiff: perfAperfB timeBtimeA n
Machine A is x faster than machine B iff perfAperfB timeBtimeA 1 x100
17

?
MachineAisntimesfasterthanmachineBiff perfAperfB timeBtimeA n
Machine A is x faster than machine B iff perfAperfB timeBtimeA 1 x100
:timeA 10s, timeB 15s
1510 1.5 A is 1.5 times faster than B 1510 1.5 A is 50 faster than B
18

?
MachineAisntimesfasterthanmachineBiff perfAperfB timeBtimeA n
Machine A is x faster than machine B iff perfAperfB timeBtimeA 1 x100
:timeA 10s, timeB 15s
1510 1.5 A is 1.5 times faster than B 1510 1.5 A is 50 faster than B
How to obtain the time?
19

:
Breakingprogramintoinstructions
HW is aware of instructions, not programs
20

:
Breakingprogramintoinstructions
HW is aware of instructions, not programs
Atlowerlevel,HWbreaksinstructionsintocycles Lower level state machines change state every cycle
21

:
Breakingprogramintoinstructions
HW is aware of instructions, not programs
Atlowerlevel,HWbreaksinstructionsintocycles Lower level state machines change state every cycle
For example:
1GHz Snapdragon runs 1000M cyclessec, 1 cycle 1ns 2.5GHz Core i7 runs 2.5G cyclessec, 1 cycle 0.25ns
22

Iron Law in Performance
Processor Performance
Time Program

Instructions Program
code size
X
Cycles Instruction
CPI
X
Time Cycle
cycle time
Architecture Implementation Realization Compiler Designer Processor Designer Chip Designer
23

Iron Law in Performance
InstructionsProgram
Instructions executed, not static code size Determined by algorithm, compiler, ISA
24

Iron Law in Performance
InstructionsProgram
Instructions executed, not static code size Determined by algorithm, compiler, ISA
25

Iron Law in Performance
InstructionsProgram
Instructions executed, not static code size Determined by algorithm, compiler, ISA
CyclesInstruction
Determined by ISA and CPU organization
Overlap among instructions reduces this term
26

Iron Law in Performance
InstructionsProgram
Instructions executed, not static code size Determined by algorithm, compiler, ISA
CyclesInstruction
Determined by ISA and CPU organization
Overlap among instructions reduces this term
Timecycle
Determined by technology, organization, clever circuit design
27

Iron Law
InstructionsProgram
Instructions executed, not static code size Determined by algorithm, compiler, ISA
CyclesInstruction
Determined by ISA and CPU organization
Overlap among instructions reduces this term
Timecycle
Determined by technology, organization, clever circuit design
28

Minimize time which is the product, NOT isolated terms
Common error to miss terms while devising optimizations
E.g. ISA change to decrease instruction count
BUT leads to CPU organization which makes clock slower
29

Minimize time which is the product, NOT isolated terms
Common error to miss terms while devising optimizations
E.g. ISA change to decrease instruction count
BUT leads to CPU organization which makes clock
slower
Be mind that: terms are interrelated
30

MIPS:
MIPS instruction countexecution time x 106
clock rateCPI x 106
But MIPS has serious shortcomings
31

MIPS
E.g. without FP hardware, an FP op may take 50
singlecycle instructions
With FP hardware, only one 2cycle instruction
Thus, adding FP hardware:
CPI increases why?
Instructionsprogram decreases why?
Total execution time decreases
BUT, MIPS gets worse!
5050 21 50 1
50 2
50 MIPS 2 MIPS
32

MIPS Ignores program
Usually used to quote peak performance
Ideal conditions guaranteed not to exceed!
33

MIPS Ignores program
Usually used to quote peak performance
Ideal conditions guaranteed not to exceed!
When is MIPS ok?
Same compiler, same ISA
E.g. same binary running on AMD Jaguar, Intel Core i7
Why?
34

MIPS Ignores program
Usually used to quote peak performance
Ideal conditions guaranteed not to exceed!
When is MIPS ok?
Same compiler, same ISA
E.g. same binary running on AMD Jaguar, Intel Core i7
Why?
Instrprogram is constant and can be factored out
35

MFLOPS FP ops in programexecution time x 106
Assuming FP ops independent of compiler and
ISA
Often safe for numeric codes: matrix size determines of FP opsprogram
However, not always safe:
Missing instructions e.g. FP divide Optimizing compilers
36

Use ONLY Time
Beware when reading, especially if details are
omitted
Be careful when you buy products
Beware of Peak
Guaranteed not to exceed
37

Iron Law Example A
MachineA:clock1ns,CPI2.0,forprogramx MachineB:clock2ns,CPI1.2,forprogramx Which machine is faster and how much?
38

Iron Law Example A
MachineA:clock1ns,CPI2.0,forprogramx MachineB:clock2ns,CPI1.2,forprogramx Which machine is faster and how much?
39

Iron Law Example A
MachineA:clock1ns,CPI2.0,forprogramx
MachineB:clock2ns,CPI1.2,forprogramx
Which machine is faster and how much?
Ans:
TimeProgram instrprogram x cyclesinstr x seccycle TimeA N x 2.0 x 1 2N
TimeB N x 1.2 x 2 2.4N
Compare: TimeBTimeA 2.4N2N 1.2
40

Iron Law Example A
MachineA:clock1ns,CPI2.0,forprogramx
MachineB:clock2ns,CPI1.2,forprogramx
Whichisfasterandhowmuch? Ans:
TimeProgram instrprogram x cyclesinstr x seccycle TimeA N x 2.0 x 1 2N
TimeB N x 1.2 x 2 2.4N
Compare: TimeBTimeA 2.4N2N 1.2
So, Machine A is 20 faster than Machine B for program x
41

Iron Law Example B Keep clockA 1ns and clockB 2ns
To get equal performance, if CPIB1.2, what should CPIA be?
42

Iron Law Example
Keep clockA 1ns and clockB 2ns
To get equal performance, if CPIB1.2, what should CPIA be?
Ans: TimeB TimeA 1
Nx2x1.2 Nx1xCPIA CPIA 2.4
43

Iron Law Example C Keep CPIA2.0 and CPIB1.2
For equal performance, if clockB2ns, what should clockA be?
Ans: TimeBTimeA 1
N x 2.0 x clockAN x 1.2 x 2 clockA 1.2ns
44

Time and performance: Machine A n times faster than Machine B
Iff TimeBTimeA n
Iron Law: Performance Timeprogram

Instructions Program
code size
X
Cycles Instruction
CPI
X
Time Cycle
cycle time
Other Metrics: MIPS and MFLOPS Beware of peak and omitted details
45

?
Executiontimeofwhatprogram?
Best case you always run the same set of programs Port them and time the whole workload
46

?
Executiontimeofwhatprogram?
Best case you always run the same set of programs
Port them and time the whole workload
In reality, use benchmarks
Programs chosen to measure performance
Predict performance of actual workload Saves effort and money
Representative? Honest? Benchmarketing…
47

Machine A
Machine B
Program 1
1
10
Program 2
1000
100
Total
1001
110
One answer: for total execution time, how much faster is B?
1001 110 9.1x
48

?
Machine A
Machine B
Program 1
1
10
Program 2
1000
100
Total
1001
110
One answer: for total execution time, how much faster is B?
1001 110 9.1x
49

How to Average?
Another:arithmeticmeansameresult
Arithmeticmeanoftimes:
AMA 10012 500.5
AMB110255
Speedup:500.5559.1x
n 1 timei n
i1
50

How to Average?
Another:arithmeticmeansameresult
Arithmeticmeanoftimes:
AMA 10012 500.5
AMB110255
Speedup:500.5559.1x
Valid only if programs run equally often, so use weighted arithmetic mean:
n 1 timei n
i1
n 1
weightitimei
n
i1
51

How to Average?
Another:arithmeticmeansameresult
Arithmeticmeanoftimes:
AMA 10012 500.5
AMB110255
Speedup:500.5559.1x
Valid only if programs run equally often, so use weighted arithmetic mean:
n 1 timei n
i1
n 1
weightitimei
n
i1
How to determine the weight?
52

Problem of AM
E.g., 30 mph for first 10 miles, then 90 mph for next 10 miles, what is average speed?
Average speed 30902 ???
53

Problem of AM
E.g., 30 mph for first 10 miles, then 90 mph for next 10 miles, what is average speed?
Average speed 30902
WRONG!
54

Problem of AM
E.g., 30 mph for first 10 miles, then 90 mph for next 10 miles, what is average speed?
Average speed 30902
Average speed total distance total time
20 1030 1090 45 mph
55

Harmonic Mean
Harmonic mean of rates
Use HM if forced to start and end with rates e.g.
reporting MIPS or MFLOPS
Why?
Rate has time in denominator
Mean should be proportional to inverse of sums of time not sum of inverses
See: J.E. Smith, Characterizing computer performance with a single number, CACM Volume 31 , Issue 10 October 1988, pp. 12021206.
n
n1
raten i1
56

Dealing with Ratios
Machine A
Machine B
Program 1
1
10
Program 2
1000
100
Total
1001
110
If we take ratios with respect to machine A
Machine A
Machine B
Program 1
1
10
Program 2
1
0.1
Average
1
5.05
57

Dealing with Ratios
Avg. wrt. machine A: A is 1, 5.05
If we take ratios with respect to machine B
Machine A
Machine B
Program 1
0.1
1
Program 2
10
1
Average
5.05
1
58

Dealing with Ratios
Avg. wrt. machine A: A is 1, 5.05
If we take ratios with respect to machine B
Machine A
Machine B
Program 1
0.1
1
Program 2
10
1
Average
5.05
1
Cant both be true!!!
Dont use arithmetic mean on ratios!
59

Geometric Mean
Use geometric mean for ratios Geometric mean of ratios
Independent of reference machine
In the example, GM for machine a is 1, for
machine B is also 1
Normalized with respect to either machine
n
n ratioi
i1
60

But…
GMofratiosisnotproportionaltototaltime
AMinexamplesaysmachineBis9.1timesfaster
GM says they are equal
If we took total execution time, A and B are equal only if
Program 1 is run 100 times more often than program 2
Generally, GM will mispredict for three or more machines
61

Use AM for times
Use HM if forced to use rates Use GM if forced to use ratios
Best of all, use unnormalized numbers to compute time
62

: SPEC2000
System Performance Evaluation Cooperative Formed in 80s to combat benchmarketing
SPEC89, SPEC92, SPEC95, SPEC2000, SPEC2006 Latest one is SPEC2017
12 integer and 14 floatingpoint programs
Sun Ultra5 300MHz reference machine has score
of 100
Report GM of ratios to reference machine
63

Benchmarks: SPEC CINT2000
Benchmark
Description
164.gzip
Compression
175.vpr
FPGA place and route
176.gcc
C compiler
181.mcf
Combinatorial optimization
186.crafty
Chess
197.parser
Word processing, grammatical analysis
252.eon
Visualization ray tracing
253.perlbmk
PERL script execution
254.gap
Group theory interpreter
255.vortex
Objectoriented database
256.bzip2
Compression
300.twolf
Place and route simulator
64

Benchmarks: SPEC CFP2000
Benchmark
Description
168.wupwise
PhysicsQuantum Chromodynamics
171.swim
Shallow water modeling
172.mgrid
Multigrid solver: 3D potential field
173.applu
Parabolicelliptic PDE
177.mesa
3D graphics library
178.galgel
Computational Fluid Dynamics
179.art
Image RecognitionNeural Networks
183.equake
Seismic Wave Propagation Simulation
187.facerec
Image processing: face recognition
188.ammp
Computational chemistry
189.lucas
Number theoryprimality testing
191.fma3d
Finiteelement Crash Simulation
200.sixtrack
High energy nuclear physics accelerator design
301.apsi
Meteorology: Pollutant distribution
65

Benchmark Pitfalls
Benchmark not representative
Your workload is IO bound, SPEC is useless
66

Benchmark Pitfalls
Benchmark not representative
Your workload is IO bound, SPEC is useless
Benchmark is too old
Benchmarks age poorly; benchmarketing pressure causes vendors to optimize compiler, hardware, software to match benchmarks
Need to be periodically refreshed
67

Amdahls Law
Motivationforoptimizingcommoncase
Speedup old time new time new rate old rate
Let an optimization speed fraction f of time by a
Newtime 1f x oldtime fs x oldtime
Speedup oldtime newtime
Speedup oldtime 1f x oldtime fs x oldtime
factor of s
Math: If f is
Speedup
1 f foldtime
1 f o l d t i m e sf o l d t i m e
1 1f
f s
68

Amdahls Law Example
Your boss asks you to improve performance by: Improve the ALU used 95 of time by 10
Improve memory pipeline used 5 of time by 10x
f
95
5
5
s
1.10
10

Speedup
Speedup 1 1f
f s
1.094
1.047
1.052
69

Amdahls Law: Limit
Make common case fast:
10
9
8
7
6
5
4
3
2
1
0
lim 1 1 s 1 f sf 1 f
0 0.2 0.4 0.6 0.8 1 f
70
Speedup

Amdahls Law: Limit
Consideruncommoncase!
If 1f is nontrivial
Speedup is limited!
Particularly true for exploiting parallelism in the
large, where large s is not cheap
GPU with e.g. 1024 processors shader cores
Parallel portion speeds up by s 1024x
Serial portion of code 1f limits speedup
E.g. 10 serial portion: 10.1 10x speedup with 1000 cores
lim 1 1 s 1 f sf 1 f
71

Benchmarks: SPEC2000
Summarize performance:
AM for time HM for rate GM for ratio
Amdahls Law:
Speedup 1 1 f
f s
72