Microsoft PowerPoint – COMP528 HAL05 performance challenges.pptx
Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528
COMP528: Multi-core and
Multi-Processor Programming
5 – HAL
Exploiting
Parallelism
Quadrature (integration)
• Problem Def: find integral of f(x) from x=a to x=b
0
2
4
6
8
10
12
0 0.5 1 1.5 2 2.5 3 3.5
Quadrature (integration)
• Problem Def: find integral of f(x) from x=a to x=b
• Approximate integral is
sum of areas under line
• Each area approximated by a rectangle
0
2
4
6
8
10
12
0 0.5 1 1.5 2 2.5 3 3.5
“quad.c” from
first lab
COMP328/COMP528 (c) mkbane, Univ of
Liverpool
• Approximate integral is
sum of areas under line
• Each area approximated by a rectangle (all of width “h”)
• Calculation of each area is independent of others,
can be done in any order
• ==> Calculate these in parallel…
0
2
4
6
8
10
12
0 0.5 1 1.5 2 2.5 3 3.5
x y=x+h
0.5*[f(x)+f(y)]
Specific example…
Each trapezoidal area is independent
Therefore we calculate in parallel
And the TOTAL integral is the SUM of
the areas of the 6 trapezoidals
This is one way we can distribute the
work between (e.g.) 3 cores
So this should go 3x faster than on 1
core
But there is some overheads – bringing
all the areas together to sum to get
the total integral
core
0
core
1
core
2
0 1
2 3
4 5
COMP328/COMP528 (c) mkbane, Univ of Liverpool
Many ways to distribute work between
resources
You want to
• Share work as evenly as possible
• Minimise overheads
• Do it logically so you understand
• Comment your code!
Scalability and Speedup
• Speedup is a ratio of the time it takes to run a program
without parallelism versus the time it runs in parallel;
• Scalability is a measure of how much speedup the program
gets as one adds more processor cores;
• The program does not scale beyond certain point when
adding more processors does not result in additional
speedup.
Speed Up & Efficiency
• Time on 1 core: t1
• Could be either sequential implementation or p=1 version of parallel code
• Time on p cores: t(p)
• Speed-up, S(p) = t1 / t(p)
• Ideal speed-up is when S(p) = p
• Super-linear speed-up
• Efficiency, E(p) = S(p)/p usually expressed as a percentage
• What is “good efficiency”?
COMP328/COMP528 (c) mkbane, University of
Liverpool
COMP328/COMP528 (c) Univ of Liverpool
COMP328/COMP528 (c) Univ of Liverpool
COMP328/COMP528 (c) Univ of Liverpool
Scaling
• Strong scaling
– Keep problem size fixed
– Time to solution decreases proportionally as #cores increases
• Weak scaling
– As increase problem size, increasing #cores will ensure
time to solution remains constant
How much parallelism is there?
• Amdahl’s law (1967)
– The computer program will never go faster than the sum of the
parts that do not run in parallel (the serial portions), no matter
how many processing elements we have
Amdahl’s Law
• Alpha : serial proportion of original code
• Tp = alpha*T1 + (1-alpha)*T1/p
• Sp = T1/Tp
• Thus Sp = 1 / (alpha + (1-alpha)/p)
• Max speed-up only dependent on the proportion of code
that is serial
• Max speed-up (p->inf): is 1/alpha
Amdahl’s Law
• Alpha : serial proportion of original code
• Tp = alpha*T1 + (1-alpha)*T1/p
• Sp = T1/Tp
• Thus Sp = 1 / (alpha + (1-alpha)/p)
• Max speed-up only dependent on the proportion of code
that is serial
• Max speed-up (p->inf): is 1/alpha
• If 1% of a code is serial what is max speed-up
• If 1% of a code is serial what is max speed-up
• Alpha=0.01, max S(p) = 1/alpha = 100
• If 1% of a code is serial what is max speed-up
• Alpha=0.01, max S(p) = 1/alpha = 100
• SO why do HPC codes use 1000s of cores?
– ? How many cores needed for that 100x speed up
e.g.
p=100, Sp = 1/[0.01+.99/100]=50.25
p=1000, Sp=1/[.001+.999/1000] = 91.0
– It is not all about a fixed problem size
Examples using Amdahl’s Law
• Apply Amdahl’s Law to determine the theoretical
maximum speed-up for a code that takes 180 seconds on 1
core and 100 seconds on 2 cores.
Examples using Amdahl’s Law
• Apply Amdahl’s Law to determine the theoretical
maximum speed-up for a code that takes 180 seconds on 1
core and 100 seconds on 2 cores.
• T(1) = 180, T(2) = 100
• T(p) = α T(1) + (1- α) T(1)/p
COMP328/COMP528 (c) Univ of Liverpool
Examples using Amdahl’s Law
• Apply Amdahl’s Law to determine the theoretical
maximum speed-up for a code that takes 180 seconds on 1
core and 100 seconds on 2 cores.
• T(1) = 180, T(2) = 100
• Let S be time taken by serial proportion and
let P be time taken on 1 core by parallel proportion
• So..
T(1) = 180 = S + P/1 & T(2) = 100 = S + P/2
T(1)-T(2) = 180-100 = 80 = (S+P/1) – (S+P/2) = P/2
• Therefore P=160 secs and S=20 secs (substitute back in the check!)
Examples using Amdahl’s Law
• Apply Amdahl’s Law to determine the theoretical
maximum speed-up for a code that takes 180 seconds on 1
core and 100 seconds on 2 cores.
• T(1) = 180, T(2) = 100
• P=160 secs and S=20 secs
(substitute back in the check!)
• Alpha = proportion serial to total = 20/180 = 1/9
• Max speed-up = 1/alpha = 9
Examples using Amdahl’s Law
• Apply Amdahl’s Law to determine the theoretical
maximum speed-up for a code that takes 180 seconds on 1
core and 100 seconds on 2 cores.
• P=160 secs and S=20 secs
• Fastest time is S
(i.e. infinite number of cores so 160/infinity = zero)
• Speed up at fastest = T(1)/T(inf) = 180/20 = 9
• The fastest this code will go, according to Amdahl, is 9
times faster than serial
• Apply Amdahl’s Law to determine the theoretical
maximum speed-up for a code that takes 180 seconds on 1
core and 100 seconds on 2 cores.
• The fastest this code will go, according to Amdahl, is 9
times faster than serial
– NB the answer is not “9 seconds”. It is “9 times faster than
serial”. Be careful of units etc when applying Amdahl’s Law
• The answer is NOT derived like this:
T(2) is 100 seconds, but T(1) is 180, so serial is 80 seconds, so alpha
is 80/180 = 4/9; you can test & see it is wrong:
T(2) = 4/9*T(1) + 5/9*T(1)/2 = 4/9*(180)+5/9*(180)/2
= 80 + 50 = 130 (c.f. correct value from question of 100 secs)
How much parallelism is there?
• J. Gustafson’s observations (around Amdahl’s law):
– For many problems as the problem size grows the work required for
the parallel part grows faster then the work required for the serial
part
– The serial fraction decreases,
so by Amdahl’s law scalability improves
Keep the run time fixed, but use more cores to do larger examples
How much parallelism is there?
• J. Gustafson’s observations (around Amdahl’s law):
– For many problems as the problem size grows the work required for
the parallel part grows faster then the work required for the serial
part
– The serial fraction decreases,
so by Amdahl’s law scalability improves
Keep the run time fixed, but use more cores to do larger examples
Amdahl vs Gustafson
• Both are right
• Both are theoretical
•
– Amdahl’s point is pessimistic: for a fixed problem/program there
is limit of its acceleration by the parallelization;
– Gustafson’s point is optimistic: for growing problems there is
chance to keep up using parallelization
In both cases, minimising the amount of time spent in sequential code is paramount
Questions via MS Teams / email
Dr Michael K Bane, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane