High Performance Computing
Performance
Dr Ligang He
Copyright By PowCoder代写 加微信 powcoder
Four approaches to modelling the performance of a parallel application
Modelling execution time
Amdahl’s law
Asymptotic analysis
Metrics to measure the parallelization quality of parallel programs
Parallel efficiency
We desire to know the improvement (or not) brought about by parallelising an application code.
The improvement can be measured by speedup
p In the simplest form, speedup is the ratio of execution time of a serial implementation to the execution time of a parallel implementation.
If n processors are used, then S(p) =t1/tp
p In practice, we run the serial program and parallel program
multiple times and obtain the average execution time of multiple runs.
What is “good” speedup?
p Linear speedup is regarded as optimal
p Maximum speedup for a parallel algorithm with p processors
p The reason is because:
• Consider the execution time of a serial application is t1
• The application is split into p processes
• Assume no overhead such as communication, synchronisation etc. • The least execution time of the parallel version is t1 / p
• So the maximum speedup is S(n) = t1 / (t1 / p)=p
q The general trend of speedup
First, speedup increases as the number of PE increases But the gap with the maximum speed also increases
After speedup reaches a maximum speedup, adding PE further is of no benefit and may harm performance.
Why this trend?
from the performance model From the surface-to-volume ratio
Parallel efficiency
Parallel efficiency: E(n) = S(p) / p
Parallel programs are usually not 100% efficient Often, S(p) << p
Modelling communication
Ø Communication time for calculating a subgrid can be computed as
Tcomm=2(ts+tw2NZ)
Ø Hence, the performance model for the execution time of calculating the velocity of the grid of points is
Tp=Tcomp+Tcomm=tc*N*(N/p)*Z+2(ts+tw2NZ)
Ø From the performance model of the execution time, we can know a lot of information.
p What will happen when we increase p?
• Execution time decreases with increasing P; the
proportion of communication cost increases
p Execution time increases with increasing N, Z, tc, ts, tw
efficiency
How the amount of computation performed (N) must scale with processor number P to keep parallel efficiency E constant
The function of N over P is called an algorithm’s iso-efficiency function
Lower order of P, more scalable
An algorithm with an iso-efficiency function of N=O(P) (linear function) is highly scalable
E.g., Increase p by five times, only need to increase N by five times to maintain efficiency
An algorithm with a quadratic or exponential iso-efficiency function is less scalable
§ E.g. increase p by five times, need to increase N by 25 times and 32 times, respectively
Speedup and parallel efficiency
à The execution time on one processor is T1 =tcN2Z
à Speedup is
tcN2Z + 2tsP + tw4NZP
à Parallel efficiency can be calculated as
tcN2Z + 2tsP + tw4NZP
efficiency
tcN2Z + 2tsP + tw4NZP
• Question: What is the iso-efficiency function, i.e, what is the
function of N over P, so that E can be a constant? • E can be reduced to
t +ts2P +tw4P
• When N=P, E remains approximately constant as P changes(except when P is small)
2D decomposition (revisit)
à If applying 2-D decomposition, then q Execution time can be modelled as
Tcomp = tcN2Z/P Tcomm = 4(ts + tw2pP Z)
TP = Tcomp + Tcomm = tcN2Z + ts4P + tw8NZpP P
1D decomposition
Tcomp = tc ⇤ N ⇤ (N/P ) ⇤ z Tcomm = 2(ts + tw2NZ)
• Parallel efficiency can be modelled as
tcN2Z+ts4P+tw8NZ P
efficiency
• Parallel efficiency can be modelled as
tcN2Z+ts4P+tw8NZ P § WhenN=pP,EremainsconstantasPincreases
§ Iso-efficiency in 1D is N=P
§ Therefore,forthisparticularcommunicationpattern,applying 2D decomposition will achieve better scalability than 1D decomposition
efficiency
Speedup approach
Using speedup approach, we can say something like “this algorithm achieved a speedup of S on p processors with problem size N”
This approach can give us some ideas about the algorithm quality, but we cannot judge the quality of an algorithm by a single speedup data
Elaborate this point in the following example
Speedup approach
Consider a sequential algorithm and its optimal execution time T=N + N2, where N is the problem size
qParallel Algorithm 1: T = N + (N2 / p)
§ Partitions the computationally expensive O(N2)
§ No other costs.
qParallel Algorithm 2: T = ((N + N2) / p) + 100
• Partitions the whole computation
• Introduces fixed overhead cost of 100.
qParallel Algorithm 3: T = ((N + N2) / p) + 0.6p2
• Partitions the whole computation
• Introduces variable overhead cost of 0.6p2
These algorithms all achieve a speedup of about 10.8 when p = 12 and N = 100 , but differentiates with each other when p becomes large.
Algorithm1:T=N+(N2 /p) Algorithm 2: T = ((N + N2) / p) + 100 Algorithm 3: T = ((N + N2) / p) + 0.6p2
Amdahl’s Law
Applications may contain elements that are not amenable to parallelisation. Let this serial fraction be f:
q If we make the remaining part n times faster by running it on n processors, then the time Tn is:
Tn = (1 f)T1 +fT1 nn1
qHence,speedupis: S(n)= (1 f)+nf f
qFor example, an application does a final (non-parallelisable) collective operation at the end of each iteration which accounts for 8% of the computation time - the maximum achievable speedup is 12.5.
qThis is Amdahl’s Law.
Application of Amdahl’s Law
à Part A takes 75% and part B takes 25% of the whole computation time à If we decide to parallelize part B, then the upper bound of the speedup is
1/0.75=1.33
à If we decide to parallelize part A, then the upper bound of the speedup is 1/0.25=4
Application of Amdahl’s Law
à Part A takes 75% and part B takes 25% of the whole computation time
à If we make part B 5 times faster, then
1 = 1.25 speedup = 0.25 + 0.75
à If we make part A twice faster, then
0.25 + 0.75
à Therefore, making A twice faster is better (and typically be much easier) than making B five times faster;
Amdahl’s Law
àAmdahl’s law shows us the limitation of parallelising codes à Limitations
• Can only tell the upper bound of the speedup for a particular algorithm
• Cannot tell if other algorithms with greater parallelism exist for the problem.
Asymptotic analysis
à In this modelling approach, we can say something like “the algorithm takes the time of O(nlog2n) on n processors”
à Limitations:
q ignore the lower-order term:
• e.g. given an algorithm with time complexity of O(nlogn), the actual time complexity could be 10n+nlogn, when n is small, 10n dominates
q Only tell the order of the execution time of a program, not its actual execution time:
• e.g. given two algorithms, one’s time complexity is 1000nlogn while the other’s is 10n2 . 1000nlogn is better than 10n2 when n exceeds a certain value, but 10n2 is less than 1000nlogn when n is less than the value
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com