CS152: Computer Architecture and Engineering
CIS 501: Comp. Arch. | Prof. | Performance
CIS 501: Computer Architecture
Copyright By PowCoder代写 加微信 powcoder
Unit 5: Performance & Benchmarking
Slides developed by , & at Upenn
with sources that included University of Wisconsin slides
by , , , and
CIS 501: Comp. Arch. | Prof. | Performance
CPU Performance
Comparing Performance
Benchmarks
Performance Laws
Performance Metrics
CIS 501: Comp. Arch. | Prof. | Performance
CIS 501: Comp. Arch. | Prof. | Performance
Performance: Latency vs. Throughput
Latency (execution time): time to finish a fixed task
Throughput (bandwidth): number of tasks per unit time
Different: exploit parallelism for throughput, not latency (e.g., bread)
Often contradictory (latency vs. throughput)
Will see many examples of this
Choose definition of performance that matches your goals
Scientific program? latency. web server? throughput.
Example: move people 10 miles
Car: capacity = 5, speed = 60 miles/hour
Bus: capacity = 60, speed = 20 miles/hour
Latency: car = 10 min, bus = 30 min
Throughput: car = 15 PPH (count return trip), bus = 60 PPH
Fastest way to send 10TB of data? (1+ gbits/second)
Answer: Used FedEx overnight to deliver the drive. 10 TB in, say, 15 hours is 1,300 mbits/second of bandwidth (600 GB/hour)
Amazon Does This!
CIS 501: Comp. Arch. | Prof. | Performance
“Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.”
Computer Networks, 4th ed., p. 91
10^5 DVDs, 10GB each, 10^5 seconds to drive across the country = 80Gb/s
CPU Performance
CIS 501: Comp. Arch. | Prof. | Performance
CIS 501: Comp. Arch. | Prof. | Performance
Cycles per Instruction (CPI)
CPI: Cycle/instruction on average
IPC = 1/CPI
Used more frequently than CPI
Favored because “bigger is better”, but harder to compute with
Different instructions have different cycle costs
E.g., “add” typically takes 1 cycle, “divide” takes >10 cycles
Depends on relative instruction frequencies
CPI example
A program executes equal: integer, floating point (FP), memory ops
Cycles per instruction type: integer = 1, memory = 2, FP = 3
What is the CPI? (33% * 1) + (33% * 2) + (33% * 3) = 2
Caveat: this sort of calculation ignores many effects
Back-of-the-envelope arguments only
Q: what’s something that this CPI calc ignores?
CIS 501: Comp. Arch. | Prof. | Performance
CPI Example
Assume a processor with instruction frequencies and costs
Integer ALU: 50%, 1 cycle
Load: 20%, 5 cycle
Store: 10%, 1 cycle
Branch: 20%, 2 cycle
Which change would improve performance more?
A. “Branch prediction” to reduce branch cost to 1 cycle?
B. Faster data memory to reduce load cost to 3 cycles?
Compute CPI
Base = 0.5*1 + 0.2*5 + 0.1*1 + 0.2*2 = 2 CPI
A = 0.5*1 + 0.2*5 + 0.1*1+ 0.2*1 = 1.8 CPI
B = 0.5*1 + 0.2*3 + 0.1*1 + 0.2*2 = 1.6 CPI
A is 1.11x faster than base
B is 1.25x faster than base
CIS 501: Comp. Arch. | Prof. | Performance
Measuring CPI
How are CPI and execution-time actually measured?
Execution time? stopwatch timer (Unix “time” command)
CPI = (CPU time * clock frequency) / dynamic insn count
How is dynamic instruction count measured?
More useful is CPI breakdown (CPICPU, CPIMEM, etc.)
So we know what performance problems are and what to fix
Hardware event counters
Available in most processors today
One way to measure dynamic instruction count
Calculate CPI using counter frequencies / known event costs
Cycle-level micro-architecture simulation
Measure exactly what you want … and impact of potential fixes!
Method of choice for many micro-architects
CIS 501: Comp. Arch. | Prof. | Performance
Frequency as a performance metric
1 Hertz = 1 cycle per second
1 Ghz is 1 cycle per nanosecond, 1 Ghz = 1000 Mhz
(Micro-)architects often ignore dynamic instruction count…
… but general public (mostly) also ignores CPI
and instead equate clock frequency with performance!
Which processor would you buy?
Processor A: CPI = 2, clock = 5 GHz
Processor B: CPI = 1, clock = 3 GHz
Probably A, but B is faster (assuming same ISA/compiler)
Classic example
Core i7 faster clock-per-clock than Core 2
Same ISA and compiler!
partial performance metrics are dangerous!
CIS 501: Comp. Arch. | Prof. | Performance
MIPS (performance metric, not the ISA)
(Micro) architects often ignore dynamic instruction count
Typically work in one ISA/one compiler treat it as fixed
CPU performance equation becomes
Latency: seconds / insn = (cycles / insn) * (seconds / cycle)
Throughput: insn / second = (insn / cycle) * (cycles / second)
MIPS (millions of instructions per second)
Cycles / second: clock frequency (in MHz)
Example: CPI = 2, clock = 500 MHz 0.5 * 500 MHz = 250 MIPS
Pitfall: may vary inversely with actual performance
Compiler removes insns, program gets faster, MIPS goes down
Work per instruction varies (e.g., multiply vs. add, FP vs. integer)
– Ie RISC vs CISC ins
Comparing Performance
CIS 501: Comp. Arch. | Prof. | Performance
CIS 501: Comp. Arch. | Prof. | Performance
Comparing Performance – Speedup
Speedup of A over B
X = Latency(B)/Latency(A) (divide by the faster)
X = Throughput(A)/Throughput(B) (divide by the slower)
A is X% faster than B if
X = ((Latency(B)/Latency(A)) – 1) * 100
X = ((Throughput(A)/Throughput(B)) – 1) * 100
Latency(A) = Latency(B) / (1+(X/100))
Throughput(A) = Throughput(B) * (1+(X/100))
Car/bus example
Latency? Car is 3 times (and 200%) faster than bus
Throughput? Bus is 4 times (and 300%) faster than car
Speedup and % Increase and Decrease
Program A runs for 200 cycles
Program B runs for 350 cycles
Percent increase and decrease are not the same.
% increase: ((350 – 200)/200) * 100 = 75%
% decrease: ((350 – 200)/350) * 100 = 42.3%
350/200 = 1.75 – Program A is 1.75x faster than program B
As a percentage: (1.75 – 1) * 100 = 75%
If program C is 1x faster than A, how many cycles does C run for? – 200 (the same as A)
What if C is 1.5x faster? 133 cycles (50% faster than A)
CIS 501: Comp. Arch. | Prof. | Performance
CIS 501: Comp. Arch. | Prof. | Performance
Mean (Average) Performance Numbers
Arithmetic: (1/N) * ∑P=1..N P_latency
For units that are proportional to time (e.g., latency)
Harmonic: N / ∑P=1..N 1/P_throughput
For units that are inversely proportional to time (e.g., throughput)
You can add latencies, but not throughputs
Latency(P1+P2,A) = Latency(P1,A) + Latency(P2,A)
Throughput(P1+P2,A) != Throughput(P1,A) + Throughput(P2,A)
1 mile @ 30 miles/hour + 1 mile @ 90 miles/hour
Average is not 60 miles/hour
Geometric: N√∏P=1..N P_speedup
For unitless quantities (e.g., speedup ratios)
For Example…
You drive two miles
30 miles per hour for the first mile
90 miles per hour for the second mile
Question: what was your average speed?
Hint: the answer is not 60 miles per hour
CIS 501: Comp. Arch. | Prof. | Performance
You drive two miles
30 miles per hour for the first mile
90 miles per hour for the second mile
Question: what was your average speed?
Hint: the answer is not 60 miles per hour
0.03333 hours per mile for 1 mile
0.01111 hours per mile for 1 mile
0.02222 hours per mile on average
= 45 miles per hour
CIS 501: Comp. Arch. | Prof. | Performance
Q: avg if we drove 30mph for 1 hour and 60mph for 1 hour?
Measurement Challenges
CIS 501: Comp. Arch. | Prof. | Performance
Measurement Challenges
Are –O3 compiler optimizations really faster than –O0?
Why might they not be?
other processes running
not enough runs
not using a high-resolution timer
cold-start effects
managed languages: JIT/GC/VM startup
solution: experiment design + statistics
CIS 501: Comp. Arch. | Prof. | Performance
Experiment Design
Two kinds of errors: systematic and random
removing systematic error
aka “measurement bias” or “not measuring what you think you are”
Run on an unloaded system
Measure something that runs for at least several seconds
Understand the system being measured
simple empty-for-loop test => compiler optimizes it away
Vary experimental setup
Use appropriate statistics
removing random error
Perform many runs: how many is enough?
CIS 501: Comp. Arch. | Prof. | Performance
what problems does HW2 data collection have?
Determining performance differences
Program runs in 20s on machine A, 20.1s on machine B
Is this a meaningful difference?
CIS 501: Comp. Arch. | Prof. | Performance
execution time
the distribution matters!
We can ask about average perf, 90% perf, etc.
Confidence Intervals
Compute mean and confidence interval (CI)
Meaning of the 95% confidence interval x ± 1.3
collected 1 sample with n experiments
given repeated sampling, x will be within 1.3 of the true mean 95% of the time
If CIs overlap, differences not statistically significant
CIS 501: Comp. Arch. | Prof. | Performance
t = critical value from t-distribution
s = sample standard error
n = # experiments in sample
t depends on CI level
CI example
130 experiments, mean = 45.4s, stderr = 10.1s
What’s the 95% CI?
t = 1.962 (depends on %CI and # experiments)
look it up in a stats textbook or online
at 95% CI, performance is 45.4 ±1.74 seconds
What if we want a smaller CI?
CIS 501: Comp. Arch. | Prof. | Performance
Benchmarking
CIS 501: Comp. Arch. | Prof. | Performance
metrics, stats, now need workloads
CIS 501: Comp. Arch. | Prof. | Performance
Processor Performance and Workloads
Q: what does performance of a chip mean?
A: nothing, there must be some associated workload
Workload: set of tasks someone (you) cares about
Benchmarks: standard workloads
Used to compare performance across machines
Either are, or highly representative of, actual programs people run
Micro-benchmarks: non-standard non-workloads
Tiny programs used to isolate certain aspects of performance
Not representative of complex behaviors of real applications
Examples: binary tree search, towers-of-hanoi, 8-queens, etc.
-mention different workloads used for different purposes
CIS 501: Comp. Arch. | Prof. | Performance
Example: SPECmark 2006/2017
performance wrt reference machine
Latency SPECmark
For each benchmark
Take odd number of samples
Choose median
Take speedup (reference machine / your machine)
Take “average” (Geometric mean) of speedups over all benchmarks
Throughput SPECmark
Run multiple benchmarks in parallel on multiple-processor system
Q: how does this account for run-to-run variability?
Example: GeekBench
Set of cross-platform multicore benchmarks
Can run on iPhone, Android, laptop, desktop, etc
Tests integer, floating point, memory bandwidth performance
GeekBench stores all results online
Easy to check scores for many different systems, processors
Pitfall: Workloads are simple, may not be a completely accurate representation of performance
We know they evaluate compared to a baseline benchmark
CIS 501: Comp. Arch. | Prof. | Performance
geekbench workloads: GEMM, FFT, encryption, file/image (de)compress
Example: GTA V
CIS 501: Comp. Arch. | Prof. | Performance
http://www.anandtech.com/show/9306/the-nvidia-geforce-gtx-980-ti-review
GTA V measures latency percentiles internally
sometimes, worst-case latency matters
google experiment: increase latency by 0.5s, 20% reduction in traffic/revenue
http://glinden.blogspot.com/2006/11/marissa-mayer-at-web-20.html
Performance Laws
CIS 501: Comp. Arch. | Prof. | Performance
Amdahl’s Law
CIS 501: Comp. Arch. | Prof. | Performance
How much will an optimization improve performance?
P = proportion of running time affected by optimization
S = speedup
What if I speedup 25% of a program’s execution by 2x?
1.14x speedup
What if I speedup 25% of a program’s execution by ∞?
1.33x speedup
1-P is the runtime of unoptimized part
P/S is runtime of optimized part
Amdahl’s Law for the US Budget
CIS 501: Comp. Arch. | Prof. | Performance
https://www.whitehouse.gov/omb/historical-tables/
scrapping Dept of Education ($111B) cuts budget by 2.7%
HHS includes Medicare/Medicaid
even Defense is ~13.5%
US Federal Gov’t Expenses 2017
Department of Education Total 111702 Department of Agriculture Total 127558 Department of Veterans Affairs Total 176050 Department of the Treasury Total 546397 all others Total 558381 Department of Defense–Military Programs Total 568896 Social Security Administration Total 1000813 Department of Health and Human Services Total 1116779 $B
Amdahl’s Law for Parallelization
CIS 501: Comp. Arch. | Prof. | Performance
How much will parallelization improve performance?
P = proportion of parallel code
N = threads
What is the max speedup for a program that’s 10% serial?
What about 1% serial?
10% serial – 10x
1% serial – 100x
Increasing proportion of parallel code
Amdahl’s Law requires extremely parallel code to take advantage of large multiprocessors
two approaches:
strong scaling: shrink the serial component
same problem runs faster
becomes harder and harder to do
weak scaling: increase the problem size
natural in many problem domains: internet systems, scientific computing, video games
doesn’t work in other domains
CIS 501: Comp. Arch. | Prof. | Performance
Use Little’s Law!
CIS 501: Comp. Arch. | Prof. | Performance
How long am I going to be in this line?
Little’s Law
Assumption:
system is in steady state, i.e., average arrival rate = average departure rate
No assumptions about:
arrival/departure/wait time distribution or service order (FIFO, LIFO, etc.)
Works on any queuing system
Works on systems of systems
CIS 501: Comp. Arch. | Prof. | Performance
L = items in the system
λ = average arrival rate
W = average wait time
Little’s Law for Computing Systems
Only need to measure two of L, λ and W
often difficult to measure L directly
Describes how to meet performance requirements
e.g., to get high throughput (λ), we need either:
low latency per request (small W)
service requests in parallel (large L)
Addresses many computer performance questions
sizing queue of L1, L2, L3 misses
sizing queue of outstanding network requests for 1 machine
or the whole datacenter
calculating average latency for a design
CIS 501: Comp. Arch. | Prof. | Performance
CIS 501: Comp. Arch. | Prof. | Performance
Performance Rules of Thumb
Design for actual performance, not peak performance
Peak performance: “Performance you are guaranteed not to exceed”
Greater than “actual” or “average” or “sustained” performance
Why? Caches misses, branch mispredictions, limited ILP, etc.
For actual performance X, machine capability must be > X
Easier to “buy” bandwidth than latency
say we want to transport more cargo via train:
(1) build another track or (2) make a train that goes twice as fast?
Use bandwidth to reduce latency
Build a balanced system
Don’t over-optimize 1% to the detriment of other 99%
System performance often determined by slowest component
System performance determined by “worst mistake” not by making everything perfect
– Build another track
Measuring LC-4 Processor Performance
CIS 501: Comp. Arch. | Prof. | Performance
– Possibly move benchmarks before
Benchmarking the LC-4 processors
Fixed workload: wireframe trace
Focus on improving frequency with pipelining
measure frequency with Vivado timing reports
Focus on improving IPC with superscalar
see how many cycles the wireframe trace takes
CIS 501: Comp. Arch. | Prof. | Performance
show results from Vivado timing report
Latency = seconds / program =
(instructions / program) * (cycles / instruction) * (seconds / cycle)
Instructions / program: dynamic instruction count
Function of program, compiler, instruction set architecture (ISA)
Cycles / instruction: CPI
Function of program, compiler, ISA, micro-architecture
Seconds / cycle: clock period
Function of micro-architecture, technology parameters
Optimize each component
this class focuses mostly on CPI (caches, parallelism)
…but some on dynamic instruction count (compiler, ISA)
…and some on clock frequency (pipelining, technology)
CIS 501: Comp. Arch. | Prof. | Performance
/docProps/thumbnail.jpeg
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com