CS计算机代考程序代写 mips compiler ECE437: Introduction to Digital Computer Design

ECE437: Introduction to Digital Computer Design
Chapter 1b (performance, 1.4 onwards)
Spring 2021
Performance of Computers
• Which computer is fastest? • Not so simple
– scientific simulation – FP performance
– program development – Integer performance
– commercial work – memory + I/O
ECE437, S’21 © Vijaykumar and Thottethodi (2)
Performance of Computers
• Want to buy the fastest computer for what you want to do
– workload is important
• Want to design the fastest computer for what they want to pay
– cost is an important criterion
ECE437, S’21 © Vijaykumar and Thottethodi (3)
Outline
• Time and performance
• Iron law
• MIPS and MFLOPS
• Which programs and how to average • Amdahl’s law
ECE437, S’21 © Vijaykumar and Thottethodi (4)
1

Defining Performance
• What is important to whom
Limited model
“Start job, wait 1. Computer system user for end.” Suggest
– minimize elapsed time for program = alternate
time_end – time_start – called response time
performance metrics.
2. Computer center manager
– maximize completion rate = #jobs/second – called throughput
ECE437, S’21 © Vijaykumar and Thottethodi (5)
Response Time vs. Throughput
• Is throughput = 1/av. response time?
– ONLY if NO overlap
– with overlap, throughput > 1/av.response time
• e.g., a lunch buffet – assume 5 entrees
– each person takes 2 minutes at every entree
• throughput is 1 person every 2 minutes
• BUT time to fill up tray is 10 minutes
– Why? Otherwise, what would the throughput be?
• because there are 5 people (each at 1 entree) simultaneously;
• if there is no such overlap throughput = 1/10
ECE437, S’21 © Vijaykumar and Thottethodi (6)
What is Performance for us?
• For computer architects
– CPU execution time = time spent running a program
• Shorter time better but people like better to be bigger to match intuition
– performance = 1/time (shorter time better perf.)
– where time = response time, CPU run time, etc.
• Elapsed time = CPU execution time + I/O wait
• We will concentrate mostly on CPU execution time
ECE437, S’21 © Vijaykumar and Thottethodi (7)
Improve Performance
• Improve (a) response time or (b) throughput?
– faster CPU
• both (a) and (b)
– Add more CPUs
• (b) but (a) may be improved due to reduced queueing
ECE437, S’21 © Vijaykumar and Thottethodi (8)
Give an example of this phenomenon
2

Performance Comparison
• Machine A is n times faster than machine B iff
– perf(A)/perf(B) = time(B)/time(A) = n
• Machine A is x% faster than machine B iff – perf(A)/perf(B) = time(B)/time(A) = 1 + x/100
• E.g., A 10s, B 15s
– 15/10 = 1.5 => A is 1.5 times faster than B
– 15/10=1+50/100=>Ais50%fasterthanB
ECE437, S’21 © Vijaykumar and Thottethodi (9)
Breaking down Performance
• A program is broken into instructions
– H/W is aware of instructions, not programs
• At lower level, H/W breaks instructions into cycles
– lower level state machines change state every cycle
• E.g., 4 GHz Opteron runs 4 B cycles/sec, 1 cycle = 0.25 ns
ECE437, S’21 © Vijaykumar and Thottethodi (10)
Iron law
• Time/program = instrs/program x cycles/instr x sec/cycle
– NEVER forget this!
• sec/cycle (a.k.a. cycle time, clock time)
– mostly determined by technology and CPU orgn. • cycles/instr (a.k.a. CPI)
– mostly determined by ISA and CPU organization – EFFECTIVE cycles/instr and NOT actual latency – overlap among instructions makes this smaller
• Each instr 5 cycles but 5 instrs overlap  CPI = 1
– AVERAGE over instrs (instrs have different CPI)
• instr/program (a.k.a. instruction count)
– instrs executed, NOT static code
– mostly determined by program, compiler, ISA
ECE437, S’21 © Vijaykumar and Thottethodi (11)
Our Goal
• Minimize time which is the product, NOT isolated terms
– Tempting because of the separate terms
– Optimizing one term may worsen time by worsening other terms!
• Common error to miss terms while devising optimizations
– E.g., ISA change to decrease instr. count – BUT leads to slower clock
ECE437, S’21 © Vijaykumar and Thottethodi (12)
3

Iron Law Example
• Machine A: clock 1 ns, CPI 2.0, for a program
• Machine B: clock 2 ns, CPI 1.2, for same program
• Which is faster and how much?
• Time/program = instrs/program x cycles/instr x sec/cycle
– Time(A):Nx2.0x1=2N
– Time(B):Nx1.2×2=2.4N
• Compare: Time(B)/Time(A) = 2.4N/2N = 1.2
• So, Machine A is 20% faster than Machine B for this program
ECE437, S’21 © Vijaykumar and Thottethodi (13)
Iron Law Example
• KeepclockofAat1nsandclockofBat 2 ns
• For equal performance, if CPI of B is 1.2, what is A’s CPI?
– Time(B)/Time(A) = 1 = (N x 2 x 1.2)/(N x 1 x CPI(A))
– CPI(A) = 2.4
ECE437, S’21 © Vijaykumar and Thottethodi (14)
Iron Law Example
• LetCPIofA= 2.0andCPIofB=1.2
• For equal performance, if clock of B is 2 ns, what is A’s clock?
ECE437, S’21 © Vijaykumar and Thottethodi (15)
Iron Law Example
• LetCPIofA=2.0andCPIofB=1.2
• For equal performance, if clock of B is 2 ns, what is A’s clock?
– Time(B)/Time(A) = 1 = (N x 1.2 x 2)/(N x 2.0 x clock(A))
– clock(A) = 1.2 ns
ECE437, S’21 © Vijaykumar and Thottethodi (16)
4

Iron Law notes
• You will see ch4a (single cycle) in the lab
• But not Ch1b (Iron law) so spend an hour to think about and internalize Ch1b
ECE437, S’21 © Vijaykumar and Thottethodi (17)
Other Metrics
• Other than execution time
• MIPS and MFLOPS – Bogus!
• MIPS = millions of instructions per sec
= instruction count/(execution time x 106) – Not MIPS, the instruction set
= clock rate/(CPI x 106) (How?) • BUT MIPS has problems
ECE437, S’21 © Vijaykumar and Thottethodi (18)
Problems with MIPS
• E.g.,withoutFPH/W,anFPopmaytake50single- cycle instrs
• withFPH/Wonlyone2-cycleinstratsameclock
• ThusaddingFPH/W
– CPIincreases(why?)TheFPopgoesfrom50/50to2/1 – butinstrs/progdecreasesmore(why?)eachoftheFPop
reduces from 50 to 1, factor of 50
– totalexecutiontimedecreases
• ForMIPS
– instrs/prog ignored
• MIPSgetsworse!
ECE437, S’21 © Vijaykumar and Thottethodi (19)
Problems with MIPS
• Ignores program
• Usually used to quote peak performance
– ideal conditions => guarantee not to exceed!! • When is MIPS ok?
– same compiler and same ISA
– e.g., same binary running on Xeon Ivy Bridge and
Xeon Broadwell
– why? instrs/prog is constant and may be ignored
ECE437, S’21 © Vijaykumar and Thottethodi (20)
5

Other (Bogus) Metrics
• MFLOPS = FP ops in program/(execution time x 106)
• Assuming FP ops independent of compiler and ISA
– Assumption not true
– Eg may not have divide instruction in ISA
– optimizing compilers can remove some insts
• Relative MIPS and normalized MFLOPS – adds to confusion!
ECE437, S’21 © Vijaykumar and Thottethodi (21)
Rules
• Use ONLY Time
– Beware when reading, especially if details are omitted
– Beware of Peak
• Peak is the performance that the chip maker guarantees not to exceed – meaningless!
ECE437, S’21 © Vijaykumar and Thottethodi (22)
Outline
• Time and performance • Iron law
• MIPS and MFLOPS
• Which programs
• How to average • Amdahl’s law
ECE437, S’21 © Vijaykumar and Thottethodi (23)
Which Programs
• Execution time of what
• Best case – you run the same set of programs everyday
– port them and time the whole “workload” • In reality, use benchmarks
– programs chosen to measure performance
– predict performance of actual workload (hopefully) – saves effort and money
– representative? honest?
ECE437, S’21 © Vijaykumar and Thottethodi (24)
6

Benchmarks: SPEC2006
• SPEC: System Performance Evaluation Cooperative
• Latest is SPEC2006, before SPEC89, SPEC92, SPEC95, SPEC2000
• 12 integer and 17 floating point programs
– normalize run time with a Gold Standard processor (SPEC ratio)
– Geometric Mean (GM) of the normalized times (why GM?)
ECE437, S’21 © Vijaykumar and Thottethodi (25)
One-minute quiz
• State the Iron Law – Time per program =
• Previous quiz
– What is the ALU operation for loads and stores?
• Add
ECE437, S’21 © Vijaykumar and Thottethodi (26)
SPEC CINT2006 http://www.spec.org/cpu2006/CINT2006/
Benchmark
Description
Xalancbmk
XML processing
astar
Path finding
mcf
Combinatorial optimization
gcc
GNU C compiler
Omnetpp
Discrete event simulation
h264ref
Video compression
libquantum
Quantum computing
gobmk
Artificial Intelligence: Go
hmmer
Gene Sequence Search
Bzip2
Compression
sjeng
Artificial Intelligence: chess
Perlbench
PERL programming language
ECE437, S’21 © Vijaykumar and Thottethodi (27)
SPEC CFP2006 http://www.spec.org/cpu2006/CFP2006/
Benchmark
Description
milc
Quantum chromodynamics
Bwaves
Fluid dynamics
Soplex
Simplex LP solver
Wrf
Weather modeling
17 in all, see webpage
Speech recognition, quantum chemistry, structural mechanics, ray tracing, etc.
ECE437, S’21 © Vijaykumar and Thottethodi (28)
7

How to average
• SPEC has 29 programs – how to average? • Example
• One answer: total execution time, then how much faster than A is B? 1001/110 = 9.1
ECE437, S’21 © Vijaykumar and (29) Thottethodi
Machine A
Machine B
Program 1
1s
10s
Program 2
1000s
100s
Total
1001s
110s
How to average
• Another:arithmeticmean(sameresult:B9.1times
faster than A )
• Arithmetic mean of times:  time
programs
 i1
i 
• AM(A)=1001/2=500.5
• AM(B)=110/2=55
• 500.5/55=9.1
• Validonlyifprogramsrunequallyoften,elseuse
“weight” factors
• Weighted arithmetic mean:
ECE437, S’21 © Vijaykumar and (30) Thottethodi
 i1 
n
/ n
for n
 n
 weight  time  / n
ii
Other Averages • E.g., 30 mph for first 10 miles
• 90 mph for next 10 miles. Average speed?
• Average speed = (30+90)/2 =60mph? WRONG
• Average speed = total distance / total time = (20 / (10/30+10/90))
= 45 mph
• What if it was 10 hours at each speed?
– instead of 10 miles
ECE437, S’21 © Vijaykumar and Thottethodi (31)
Harmonic Mean
• Harmonic mean of rates =
1 n1
rate  i1 i n
– Use HM if forced to start and end with rates • Trick to do arithmetic mean of times but
using rates and not times
ECE437, S’21 © Vijaykumar and (32) Thottethodi
8

Practice
• Machine A runs 10M instructions at 15 MIPS and the next 10M instructions at 45 MIPS
– Average MIPS = ??
• Machine A runs for 10 seconds at 15 MIPS and the next 10 seconds at 45 MIPS
– Average MIPS = ??
ECE437, S’21 © Vijaykumar and Thottethodi (33)
Dealing with Ratios
• Absolute times
• Now consider ratios (w.r.t. A)
• Averages: A = 1, B = 5.05 ECE437, S’21 © Vijaykumar and Thottethodi (34)
Machine A
Machine B
Program 1
1s
10s
Program 2
1000s
100s
Machine A
Machine B
Program 1
1
10
Program 2
1
0.1
Dealing with Ratios
• Absolute times
• Now consider ratios (w.r.t. B)
• Averages: A = 5.05, B = 1
• Both cannot be true
ECE437, S’21 © Vijaykumar and Thottethodi (35)
Machine A
Machine B
Program 1
1s
10s
Program 2
1000s
100s
Machine A
Machine B
Program 1
0.1
1
Program 2
10
1
Geometric Mean
• Don’tusearithmeticmeanonratios(normalized numbers)
• Usegeometricmeanforratios
– geometric mean of ratios =
– Use GM if forced to use ratios
• Independentofreferencemachine(mathproperty)
• Intheexample,GMformachineAis1,formachine
B is also 1
• normalizedwithrespecttoeithermachine
ECE437, S’21 © Vijaykumar and (36) Thottethodi
n
n  ratio
i i1
9

But..
• Geometric mean of ratios is not proportional to total time
• AM in example says machine B is 9.1 times faster
• GM says they are equal
• If we took total execution time, A and B are equal only if
– program 1 is run 100 times more often than program 2
• Generally, GM will mispredict for three or more machines
ECE437, S’21 © Vijaykumar and Thottethodi (37)
Previous Midterm Question
• Machine A runs 9 times faster than machine B when running program-P,
• Machine B runs 4 times faster than machine A when running program Q.
• Which machine is faster (and by what factor) when averaged across both programs?
ECE437, S’21 © Vijaykumar and Thottethodi (38)
Summary for Averages
• Use AM for times
• Use HM if forced to use rates
• Use GM if forced to use ratios
• Better yet, use unnormalized, raw run times to compute performance
ECE437, S’21 © Vijaykumar and Thottethodi (39)
Amdahl’s Law
• Whydoesthecommoncasematterthemost?
• Letanoptimizationspeedffractionoftimebya
factor of s
• assumingthatoldtime=T,whatisthespeedup?
– fisthe“affected”fractionofT – (1-f) is the unaffected fraction
• Speedup = timeold  unaffectedold  affectedold timenew unaffectednew affectednew
• = (1f)TfT  ( 1  f )  T  sf  T
ECE437, S’21 © Vijaykumar and Thottethodi
1
( 1  f )  sf
(40)
10

Amdahl’s Law Example
• Your boss asks you to improve processor performance
• Two options: What should you do?
– improve the ALU used 95% of time, by 10%
– improve the square-root unit used 5%, by a factor of 10
ECE437, S’21 © Vijaykumar and (41) Thottethodi
f
s
Speedup
95%
1.10
1.094
5%
10
1.047
5%

1.052
Amdahl’s Law: Limit
• Make common case fast because:
10
8
11 6 lim  
s 1ff s 1f 4 2 0
ECE437, S’21 © Vijaykumar and (42) Thottethodi
0 0.2 0.4 0.6 0.8 1 f
Amdahl’s Law
• “Make common case fast”
– Scientific heuristic, not religious commandment – Use for intuition, verify with numbers
• 60% can be improved by a factor of 2 – Speedup = 1/(0.4+0.6/2) = 1/0.7
• 40% can be improved by a factor of 8 – Speedup = 1/(0.6+0.4/8) = 1/0.65
• Second option is better
– Less common case, but higher speedup compensates
ECE437, S’21 © Vijaykumar and Thottethodi (43)
Summary of Chapter 1b
• Timeandperformance:
– MachineAntimesfasterthanMachineB – iff Time(B)/Time(A) = n
• IronLaw:Time/prog
– InstrcountxCPIxCycletime
• OtherMetrics:MIPSandMFLOPS – Bewareofpeakandomitteddetails
• Benchmarks:SPEC2006(int+FP) • Summarizeperformance:
– AMfortime,HMforrate,GMforratio
• Amdahl’s Law: Speedup =  1  common case fast
1ff s 
ECE437, S’21 © Vijaykumar and (44) Thottethodi
11
Speedup

Single-cycle Datapath
CPI
Inst. Count
Cycle Time
• Performance Implications
– Minimize all three
– Insts/prog fixed — f(ISA,compiler)
– CPI = 1 : As good as it gets (*)
– Clock cycle time : high, “lw” critical path
ECE437, S’21 © Vijaykumar and Thottethodi (45)
Back to Ch4b
• To improve performance beyond single- cycle datapath
– Now that we understand performance
ECE437, S’21 © Vijaykumar and Thottethodi (46)
12