Overview Parallel performance Quantifying parallel performance Summary and next lecture
XJCO3221 Parallel Computation
University of Leeds
Copyright By PowCoder代写 加微信 powcoder
Lecture 4: Theory of parallel performance
XJCO3221 Parallel Computation
Parallel performance Quantifying parallel performance Summary and next lecture
Previous lecture
This lecture Notation
Previous lecture
In the last lecture we started to look at solving problems in parallel:
Vector addition, which can be parallelised for shared memory systems by using the fork-join construct.
Implemented in OpenMP as a single line just before the loop:
#pragma omp parallel for
Mandelbrot set, which has a nested loop.
Both data parallel problems (‘maps’) as calculations in the
loop are independent.
Still difficult to achieve good performance for the Mandlebrot set.
XJCO3221 Parallel Computation
Parallel performance Quantifying parallel performance Summary and next lecture
Previous lecture
This lecture
This lecture
Now we will look at some general considerations for parallel performance:
Introduce common parallel overheads. Common metrics for parallel performance.
Classic models for predicting parallel speed-up and highlighting potential pitfalls.
How these relate to scaling, i.e. how performance varies with the number of processors and the problem size.
XJCO3221 Parallel Computation
Parallel performance Quantifying parallel performance Summary and next lecture
Previous lecture This lecture Notation
For this lecture we will use the following notation:
Symbol Meaning Notes
n Problem size e.g. vector size, list length, image size, . . .
p No. processing units e.g. cores, threads, pro- cesses, . . .
ts Serial execution time ‘Optimal’ tp Parallel execution
f Serial fraction Amdahl, Gustafson-Barsis
XJCO3221 Parallel Computation
Parallel performance
Quantifying parallel performance Summary and next lecture
What we are trying to achieve
Parallel acceleration
Challenges to parallel performance
What we are trying to achieve
Assume a problem can be solved by a serial algorithm in a time ts . We assume this is optimal, i.e. cannot be improved (in serial).
In practice the optimal ts is rarely employed. Optimal solution may not be known.
May be known, but take too long to implement.
Usually consider the serial algorithm which is ‘equivalent’ to the parallel one.
For instance, if developing a parallel bubblesort, would probably compare to serial bubblesort (rather than quicksort, mergesort, heapsort etc.).
XJCO3221 Parallel Computation
Parallel performance
Quantifying parallel performance Summary and next lecture
What we are trying to achieve
Parallel acceleration
Challenges to parallel performance
Parallel acceleration
One way to improve on the serial execution time ts is to implement a parallel solution on parallel hardware.
May be possible to ‘beat’ ts by exploiting simultaneous calculations.
Can also make better use of shared memory cache.
Denote the (not necessarily optimal) parallel execution time tp.
Measured in same units as ts.
On ‘as similar as possible’ hardware.
Sometimes known as the wall clock time, as it is what ‘a clock on the wall’ would measure.
XJCO3221 Parallel Computation
Parallel performance
Quantifying parallel performance Summary and next lecture
What we are trying to achieve
Parallel acceleration
Challenges to parallel performance
Simultaneous calculations (ideally)
core 0 core 1 core 2 core 3
Perform calculations in parallel on p=4 cores
Single core
XJCO3221 Parallel Computation
Parallel performance
Quantifying parallel performance Summary and next lecture
What we are trying to achieve
Parallel acceleration
Challenges to parallel performance
Multi-core memory cache
Recall from Lecture 2 that using multiple cores can make better use of memory cache.
Core 1 Core 2
Fewer cache misses:
Cache lines pulled up by one core may include data required by another core.
Depending on how data is arranged in memory and accessed, a parallel code may result in fewer cache misses overall that the equivalent serial algorithm.
Main memory
XJCO3221 Parallel Computation
Parallel performance
Quantifying parallel performance Summary and next lecture
What we are trying to achieve Parallel acceleration
Challenges to parallel performance
Challenges to parallel performance
These potential benefits must be offset by the many challenges to achieving good parallel performance.
In Lecture 2 we saw one example: false sharing:
Hardware performance loss in maintaining cache coherency when two cores repeatedly write to the same cache line, even though they never read the other core’s data.
Over the coming lectures we will see two important, general challenges: synchronisation and load balancing.
XJCO3221 Parallel Computation
Parallel performance
Quantifying parallel performance Summary and next lecture
What we are trying to achieve Parallel acceleration
Challenges to parallel performance
Synchronisation
In the fork-join construct from last lecture, multiple threads complete before the main thread continues.
Single thread
Multiple threads
Single thread
This joining requires resources:
Main thread may repeatedly probe worker thread status. Alternatively, workers may signal their completion to main. An example of synchronisation.
XJCO3221 Parallel Computation
Parallel performance
Quantifying parallel performance Summary and next lecture
What we are trying to achieve Parallel acceleration
Challenges to parallel performance
Load balancing
A related issue is load balancing:
What if the worker threads did not all finish at the same time?
Some would be idle, waiting for others to finish.
Poor use of available resources.
This happens in the Mandelbrot set since each thread performs different numbers of calculations [cf. last lecture; Lecture 13].
XJCO3221 Parallel Computation
Parallel performance
Quantifying parallel performance Summary and next lecture
What we are trying to achieve Parallel acceleration
Challenges to parallel performance
Parallel overheads
Even if these challenges could be overcome, there are inevitable overheads. For example:
Time and resources to create, schedule and destroy threads and/or processes during runtime (e.g. fork-join).
Communication between threads/processes not present in the serial equivalent.
Computation not present in serial, e.g. when partitioning the problem size between threads.
The impact may be small or large depending on parallel algorithm and hardware architecture.
XJCO3221 Parallel Computation
Overview Parallel performance Quantifying parallel performance Summary and next lecture
Metrics for parallel performance
Classic models
Weak versus strong scaling
Metrics for parallel performance
There are various measurements of parallel vs. serial performance. The most common is the speedup S:
If the parallel version was p times faster than the serial: tp=1ts =⇒S=ts=p
Rarely realised in practice due to parallel overheads.
XJCO3221 Parallel Computation
Overview Parallel performance Quantifying parallel performance Summary and next lecture
Metrics for parallel performance
Classic models
Weak versus strong scaling
Speedup example
Parallel overhead
XJCO3221 Parallel Computation
Overview Parallel performance Quantifying parallel performance Summary and next lecture
Metrics for parallel performance
Classic models
Weak versus strong scaling
Superlinear speedup
S(p) > p is possible: Usually due to memory
cache (or suboptimal ts). This is super-linear speedup.
Example (right): Benchmark computational fluid dynamics algorithm.
However, this is rare – most commonly see S(p) < p.
From Parallel Programming in OpenMP, Chandra et al. (Academic, 2001).
XJCO3221 Parallel Computation
Overview Parallel performance Quantifying parallel performance Summary and next lecture
Metrics for parallel performance
Classic models
Weak versus strong scaling
Efficiency
Another common parallel performance metric is the efficiency E: E=ts =S
‘Ideal’ speedup S = p corresponds to E = 1.
Often expressed as a percentage: E =1=100%.
Typically E < 1 due to parallel overheads.
Superlinear speedup gives E > 1.
XJCO3221 Parallel Computation
Overview Parallel performance Quantifying parallel performance Summary and next lecture
Metrics for parallel performance
Classic models
Weak versus strong scaling
Models for parallel performance
Desirable to theoretically predict tp for parallel algorithms. Select the ‘best’ without development and testing. Identify ‘bottlenecks’ for further investigation.
Challenging to derive precise equations for tp:
Need to include e.g. memory cache, thread scheduler etc. Involve many unknown parameters requiring calibration. Would need re-calibration for new hardware.
However, even simple models can predict trends. Parallel scaling, which refers to the variation with p.
XJCO3221 Parallel Computation
Overview Parallel performance Quantifying parallel performance Summary and next lecture
Metrics for parallel performance
Classic models
Weak versus strong scaling
Amdahl’s law
Suppose a fraction f of ts cannot be parallelised. ts = fts+(1−f)ts
=⇒tp ≥fts+(1−f)ts p
=⇒S=ts≤ ts =1
tp ft +(1−f)ts f+1−f
This is Amdahl’s law1 (1967).
For large p it predicts S ≤ f1 regardless of p.
e.g. f = 0.2, maximum speedup of 5, even for p=∞!
1Amdahl, AFIPS Conference Proceedings 30, 483 (1967). XJCO3221 Parallel Computation
Overview Parallel performance Quantifying parallel performance Summary and next lecture
Metrics for parallel performance
Classic models
Weak versus strong scaling
Schematic for Amdahl’s law (p = 3)
Parallelisable
core 1 core 2
Parallel code
divided between p=3 processors
XJCO3221 Parallel Computation
Overview Parallel performance Quantifying parallel performance Summary and next lecture
Metrics for parallel performance
Classic models
Weak versus strong scaling
Gustafson-Barsis law
However, Amdahl assumed ts — and hence n — was fixed. Suppose instead n increases with p such that tp is fixed.
tp = ftp+(1−f)tp =⇒ts ≤ftp+p(1−f)tp
=⇒ S ≤ f +p(1−f)=p+f(1−p)
Now S ≤ (1 − f )p for large p – no upper bound as p → ∞.
This is the Gustafson-Barsis law, or just Gustafson’s law1. 1Gustafson, Comm. ACM 31, 532 (1988).
XJCO3221 Parallel Computation
Overview Parallel performance Quantifying parallel performance Summary and next lecture
Metrics for parallel performance
Classic models
Weak versus strong scaling
Schematic for Gustafson-Barsis law (p=3)
Increase p with problem size
XJCO3221 Parallel Computation
Overview Parallel performance Quantifying parallel performance Summary and next lecture
Metrics for parallel performance
Classic models
Weak versus strong scaling
Amdahl versus Gustafson-Barsis
f=0 f=0.01 f=0.1 f=0.2
-Barsis 16
1248 16 1248 16
XJCO3221 Parallel Computation
Overview Parallel performance Quantifying parallel performance Summary and next lecture
Metrics for parallel performance Classic models
Weak versus strong scaling
Weak versus strong scaling
The differences are encapsulated in weak versus strong scaling:
Strong scaling: Increasing p with n fixed. Amdahl’s law.
Cannot control system size. e.g. data analysis/mining.
Weak scaling: Increasing n with p.
Gustafson-Barsis law.
Have freedom to vary n.
e.g. higher resolution meshes for scientific/engineering applications; more/larger layers in neural networks.
XJCO3221 Parallel Computation
Overview Parallel performance Quantifying parallel performance Summary and next lecture
Summary and next lecture
Summary and next lecture
Today we have looked at parallel performance: Two common metrics: speedup and efficiency.
Challenging to achieve ideal speedup due to various parallel overheads.
Classic models known as Amdahl’s law and the Gustafson-Barsis law.
Correspond to strong and weak scaling, respectively.
Next time we will look more closely at data dependencies in parallel loops.
XJCO3221 Parallel Computation
程序代写 CS代考 加微信: powcoder QQ: 1823890830 Email: powcoder@163.com