程序代写代做代考 algorithm cache Microsoft PowerPoint – Performance-1 [Compatibility Mode]

Microsoft PowerPoint – Performance-1 [Compatibility Mode]

High Performance Computing
Course Notes

Performance I

Dr Ligang He

2Computer Science, University of Warwick

Metrics to measure the parallelization
quality of parallel programs

Degree of Parallelism, average parallelism

Effective work

Speedup

Parallel efficiency

3Computer Science, University of Warwick

Degree of Parallelism

 Degree of Parallelism (DOP)

 The number of processors engaged in execution at the same time

 Two forms of functions: continuous form and discreet form

4Computer Science, University of Warwick

Degrees of Parallelism

Factors that affect the DOP include:
 Application properties
 Data dependency,
 Communication overhead

 Resource limitations
 number of processors,
 memory, I/O

 Algorithms
 how does the algorithm divide up

work?

5Computer Science, University of Warwick

Effective Work

Effective Work

 This is the total amount of computation executed within a

given time interval.

 Effective work relates to DOP

6Computer Science, University of Warwick

Effective work

Calculating Effective work
 m homogeneous processors
 processing capacity of a single

processor (execution rate) = Δ
 DOP(t)=Number of busy PEs at time t

in [t1, t2]
 Total effective work in discrete form

 Total effective work in continuous
form:

7Computer Science, University of Warwick

Average Parallelism

Average parallelism:

 Continuous form:

 Discrete form:

8Computer Science, University of Warwick

Speedup

We desire to know the improvement (or not) brought about
by parallelising an application code.

The improvement can be measured by speedup
 In the simplest form, speedup is the ratio of execution time of a

serial implementation to the execution time of a parallel
implementation.

If n processors are used, then:

t1 is the worst case execution time of the optimal serial implementation.
tn is the worst case execution time of the parallel algorithm using n
processors.

9Computer Science, University of Warwick

Speedup

What is “good” speedup?

 Linear speedup is regarded as optimal

 Maximum speedup for a parallel algorithm with n processors is n.

 To illustrate this:

• Consider the execution time of an application is t1

• The application is split into n processes

• Assume no overheads, communications, synchronisation etc.

• The least execution time is t1 / n

• So the maximum speedup is S(n) = t1 / (t1 / n)=n

 Not always true (we may achieve superlinearity in some special

circumstances)

10Computer Science, University of Warwick

Speedup

Some tasks can exceed linear speedup.

 This is superlinear speedup (S(n) > n)

Reasons

 Cache or memory effects

 Evidence of sub-optimal sequential implementation

11Computer Science, University of Warwick

Speedup

 The general trend of speedup as the number of processors
increases

• First, speedup increases as the number of PE increases,
but the gap with the maximum speed also increases

• After speedup reaches a maximum speedup, adding
processors further is of no benefit and will harm
performance.

12Computer Science, University of Warwick

Speedup

How to obtain good S(n) for large n?

 Algorithm design: Minimal sequential component and good

percentage of inherent (data) parallelism.

 workload management: balancing workload among

processors

 When there are different ways to partition data and achieve

load balance, try to maintain a high ratio of computation to

communication (computation represents effective work while

communication represents overhead)

• Use the way which leads to less communication

• Low frequency of communications between processors

• increase the size of the work run by each processor.

13Computer Science, University of Warwick

Reducing the Impact of Communication

Communication has crucial impact on the performance of parallel
programming

How to reduce the impact of communication:

 Minimize the amount of communication (e.g. by maintaining a good
data locality)

 Overlap communications with computation where possible.

 Reduce latency and overhead by sending a few large messages,
rather than a lot of small messages.

 At the hardware level, can reduce latency by using fast (but
expensive) communications.

14Computer Science, University of Warwick

Parallel efficiency

Parallel efficiency:

E(n) = S(n) / n

Parallel programs are not usually 100% efficient, i.e. S(n) << n Main issues that affect parallel efficiency are:  Same as the factors for affecting speedup  Plus the impact of n; typically, greater n, lower efficiency 15Computer Science, University of Warwick Iso-efficiency Constant Efficiency:  How the amount of computation performed (N) must scale with processor number P to keep parallel efficiency E constant  The function of N over P is called an algorithm’s iso-efficiency function  An algorithm with an iso-efficiency function of O(P) is highly scalable • E.g., Increase p by three times, only need to increase N by three times to maintain efficiency  An algorithm with a quadratic or exponential iso-efficiency function is less scalable  E.g. increase p by three times, need to increase N by 9 times and 8 times, respectively 16Computer Science, University of Warwick Work out the iso-efficiency function  Given the parallel efficiency function as follows, work out the iso-efficiency function. Answer: N=O(P) 17Computer Science, University of Warwick Four approaches to modelling the performance of a parallel application Speedup Amdahl’s law Asymptotic analysis Modelling execution time 18Computer Science, University of Warwick Speedup approach Using speedup approach, we can say something like “this algorithm achieved a speedup of S on p processors with problem size N” This approach can give us some ideas about the algorithm quality, but we cannot judge the quality of an algorithm by a single speedup data Elaborate this point in the following example 19Computer Science, University of Warwick Speedup approach Consider a sequential algorithm and its optimal execution time T=N + N2, where N is the problem size Parallel Algorithm 1: T = N + (N2 / p)  Partitions the computationally expensive O(N2)  No other costs. Parallel Algorithm 2: T = ((N + N2) / p) + 100 • Partitions the whole computation • Introduces fixed overhead cost of 100. Parallel Algorithm 3: T = ((N + N2) / p) + 0.6p2 • Partitions the whole computation • Introduces variable overhead cost of 0.6p2 20Computer Science, University of Warwick Speedup approach These algorithms all achieve a speedup of about 10.8 when p = 12 and N = 100 , but differentiates with each other when p becomes large. Algorithm 2: T = ((N + N2) / p) + 100 Algorithm 1: T = N + (N2 / p) Algorithm 3: T = ((N + N2) / p) + 0.6p2 21Computer Science, University of Warwick Amdahl’s Law Applications may contain elements that are not amenable to parallelisation. Let this serial fraction be f:  If we make the remaining part n times faster by running it on n processors, then the time Tn is:  Hence, speedup is: For example, an application does a final (non-parallelisable) collective operation at the end of each iteration which accounts for 8% of the computation time - the maximum achievable speedup is 12.5. This is Amdahl’s Law. 22Computer Science, University of Warwick Application of Amdahl’s Law  Part A takes 75% and part B takes 25% of the whole computation time  If we decide to parallelize part B, then the upper bound of the speedup is 1/0.75=1.33  If we decide to parallelize part A, then the upper bound of the speedup is 1/0.25=4 23Computer Science, University of Warwick Application of Amdahl’s Law  Part A takes 75% and part B takes 25% of the whole computation time  If we make part B 5 times faster, then speedup =  If we make part A 2 times faster, then speedup =  Therefore, making A twice faster is better (and typically be much easier) than making B five times faster; 24Computer Science, University of Warwick Amdahl’s Law Amdahl’s law shows us the limitation of parallelising codes Disadvantages • Can only tell the upper bound of the speedup for a particular algorithm • Cannot tell if other algorithms with greater parallelism exist for the problem. 25Computer Science, University of Warwick Asymptotic analysis  In this modelling approach, we can say something like “the algorithm takes the time of O(nlogn) on n processors”  Disadvantage:  ignore the lower-order term: • e.g. given an algorithm with time complexity of O(nlogn), the actual time complexity could be 10n+nlogn, when n is small, 10n dominates  Only tell the order of the execution time of a program, not its actual execution time: • e.g. given two algorithms, one’s time complexity is 1000nlogn while the other’s is 10n2 . 1000nlogn is better than 10n2 when n exceeds a certain value, but 10n2 is less than 1000nlogn when n is less than the value