COMP2100/COMP6442
Benchmarking & Performance
– Lecture 8]
Kin Chau [
Sid Chi
1
Benchmarking
• Benchmarking
• Compare the performance of different algorithms and systems
• Evaluate practical performance with real-world input data
• Optimize best practice, improve implementation and plan resource allocation • Collect and analyze practical performance data
• Provide assurance and confidence before practical deployment
• Benchmarking trials
• Construct a suite of independent trials for which the algorithm is executed
• Trials are executed and milli/nanosecond-level timings are taken before and after the algorithm is executed
• Eliminate inconsistent measurements
2
Benchmarking
• Performance measurements may be different in a different time, even with same code and implementation
• Computer background processes may affect practical performance • Eliminate outliner performance data
• In Java, the system garbage collector may affect the performance
• The system garbage collector is invoked immediately prior to the trial
• Call System.gc()
• Although this cannot guarantee that the garbage collector does not execute during the trial, it may reduce the impact
3
Benchmarking in Java
• Example of benchmarking simple summation
public class Trial {
public static void main (String[] args) {
for (long len = 1000000; len <= 5000000; len += 1000000) { for (int i = 0; i < 30; i++) {
System.gc(); //Invoke garbage collector long start = System.currentTimeMillis();
// Simple summation to be timed
long sum = 0;
for (int x = 1; x <= len; x++) { sum += x; }
long end = System.currentTimeMillis();
// Output runtime
System.out.println(“trial:” + len + “ runtime:” + (end - start));
}
}
} }
4
Benchmarking in Java
• Instead of millisecond-level timers, nanosecond timers could be used • In Java, invoke System.nanoTime()
for (long len = 1000000; len <= 5000000; len += 1000000) { for (int i = 0; i < 30; i++) {
System.gc(); //Invoke garbage collector
long start = System.nanoTime(); //Nanosecond timer
// Simple summation to be timed
long sum = 0;
for (int x = 1; x <= len; x++) { sum += x; }
long end = System.nanoTime();
// Output runtime
System.out.println(“trial:” + len + “ runtime:” + (end - start));
}
}
5
Benchmarking Data Structures
• Benchmark binary search tree, red-black tree and AVL tree
• Consider worst-case (i.e., highly unbalanced tree) and average-case
(i.e., random input sequences)
• Which one of the tree data structures is the best practically?
• BST is surprisingly efficient. Why?
Source: https://codedeposit.wordpress.com/2015/10/07/red-black-vs-avl/
6
Aspects of Performance Analysis
• Does your software perform as what you expect? • Does it complete fast?
• Does it work well with more inputs? • Does it break?
• Metrics of performance:
• Latency, throughput, memory size • Network bandwidth
• Concurrency
• Performance analysis • Best case
• Average case • Worst case
7
Performance Profiling
• Tools that provide a visual interface for detailed information about the runtime operations of a program
• Understand how your program utilizes different computing resources • e.g. memory, CPU, GPU, hard disk, network
• Identify the bottleneck of your program • Optimize program implementation
• Locate potential bugs in your program
• Example:
• JConsole
• VisualVM
• Eclipse Trace Compass
8
VisualVM
9
Trace Compass
10
Performance Evaluation by Simulation
• How to obtain practical performance evaluation of algorithms and systems considering realistic inputs?
• Real-world deployment
• Setup a small-scale deployment for performance evaluation • Expensive or only small-scale; Cannot obtain prior insights
• Modeling and analysis
• Mathematical reasoning of performance
• Only apply to simple systems; modeling needs simplifying assumptions
• Simulation
• Generate artificial inputs to estimate real-world performance • A balance between realism and efficiency
11
Performance Evaluation by Simulation
• Simulation is cost-effective and does not require many simplifying assumptions, which is a viable approach for performance evaluation
• Dynamic systems
• Systems (which are controlled by certain algorithms) change with time and
respond to random inputs, e.g., a system playing Tetris
• Sample path is the evolution of states in a dynamic system
• Computational construction of sample paths is a major part of simulation
• Discrete-event simulations
• Some sample paths are characterized by finite events
• Construct a random generation model for the discrete events
12
Motivating Question of Performance
• You have a program X with two component parts A and B
• Each of which takes 10 minutes. What is the latency of X?
• Latency is the time from the beginning to the end to complete a job
• Suppose that you can speedup part B by a factor of 5 • What is the latency now?
• What is the overall speedup?
• If A and B are sequential, then Amdahl’s Law provides an answer
CPU Processing Time
AB
13
Amdahl’s Law
• How much extra performance can you have if you speed up some part of your program?
• Notations:
• S is the overall performance gain • k is the speed-up factor
• α is the portion of speed-up
Unimproved part Improved part
•𝑇 =1−𝛼𝑇+𝛼𝑇𝑜𝑙𝑑=𝑇(1−𝛼+𝛼)
𝑛𝑒𝑤 𝑜𝑙𝑑𝑘𝑜𝑙𝑑 𝑘
•𝑆=𝑇𝑜𝑙𝑑= 1
𝑇 1−𝛼 +𝛼 𝑛𝑒𝑤 𝑘
14
Example
• Your program has one very slow procedure that consumes 70% of the total time. Next, you improve it by a factor of 2
• What is the performance gain in the overall latency? • α = 0.7 (70%)
• k= 2
•𝑆=𝑇𝑜𝑙𝑑= 1 = 1 =1.538 𝑇 1−𝛼 +𝛼 1−0.7 +0.7
𝑛𝑒𝑤 𝑘 2
CPU Processing Time
AB
15
Example
• Floating point instructions could be improved by 2x. Only 15% of instructions are floating point
• What is the performance gain in the overall latency? • α = 0.15 (15%)
• k= 2
•𝑆=𝑇𝑜𝑙𝑑= 1 = 1 =1.081 𝑇 1−𝛼 +𝛼 0.15
𝑛𝑒𝑤 𝑘 1−0.15 + 2
CPU Processing Time
AB
16
Amdahl’s Law
17
Application to Parallel Processing
• Divide the program into sequential part, 1-P, and parallel part, P • Assume there are N processors then the improvement of the
parallelizable part is N
• Based on Amdahl’s law, the performance gain from N processors is:
1
•𝑆=
𝑁
1−𝑃 +𝑃
Sequential Parallelizable
1-P P CPU 1 P/N
...
CPU N P/N
18
Limit as N
• The performance gain from a very large number of processors is •𝑆=lim 1 =1
→∞
𝑁→∞ 1−𝑃 +𝑃 1−𝑃 𝑁
• Fundamental limitation of performance gain from parallelization (diminishing returns); adding more CPUs may not improve performance
• Neglects other potential bottlenecks, e.g., memory bandwidth and I/O bandwidth Sequential Parallelizable
1-P P
CPU 1 ...
CPU ∞
19
Reference
• Amdahl’s Law paper: “Validity of the single processor approach to achieving large scale computing capabilities”
• https://inst.eecs.berkeley.edu/~n252/paper/Amdahl.pdf
• Java and Parallel Programming
• https://www.oracle.com/technical-resources/articles/java/fork-join.html
20