NWEN303 Concurrent Programming
17: Amdahl vs. Gustafson, final showdown!
Marco Servetto VUW
●
●
Define Speedup Speedup = sequentialTime / parallelTime
Empiric measurement:
suppose we have a function fSeq and an equivalent version fPar,
time0 = System.time(); fSeq(input)
time1 = System.time(); fPar(input)
time2 = System.time();
speedUp = (time1-time0) / (time2 -time1);
Now I can have an empiric result, but only for a certain input.
–
–
●
–
●
–
● ●
Linear Speedup Maximal theoretical speedup:
Speedup = numbers of used cores
That is:
if we have 64 cores and we split in 10 tasks, we can not achieve more then 10X
If we have 64 cores and we split in 100 tasks, we can not achieve more then 64X
This holds for embarrassingly parallel problems.
What happens if even a little part of the problem can not be parallelized at all?
–
●
●
What happens if even a little part of the problem can not be parallelized at all?
Example: given a list of Persons with an age and a date of birth, print the information of the list; but with the age they will have tomorrow.
We can parallelize the computation to discover if tomorrow is their birthday, but we still need to print them one by one.
●
Sub-Linear Speedup
●
A part of the problem is parallelizable (P) but there is a part that is intrinsically sequential (S)
S P P P P S 1 core -> 6/6 = 1X
●
Sub-Linear Speedup
2 cores -> 4 cores ->
1M cores -> SpeedupA = (S+P) / (S + (P/NCores)) //Amdahl
SPPS PP
SPS P
P
P
6/4 = 1.5X 6/3 = 2X
SS
6/2 = 3X
●
●
Of course, we can not get rid of the sequential part, so the maximum we can obtain is (S+P)/S.
Lets roll some numbers in: 15% sequential:
100/15
100/(15 +(85/4)) 100/(15 +(85/16)) 100/(15 +(85/64)) 100/(15 +(85/256))
100/5
100/(5 +(95/4)) 100/(5 +(95/16)) 100/(5 +(95/64)) 100/(5 +(95/256))
100/1
100/(1 +(99/4)) 100/(1 +(99/16)) 100/(1 +(99/64)) 100/(1 +(99/256))
= 6.7X = 2.8X = 4.9X = 6.1X = 6.5X
= 20X = 3.5X =9X
= 15.5X = 18.5X
= 100X = 3.9X = 14X = 39.3X = 72.1X
15% sequential 15% sequential 15% sequential 15% sequential
5% sequential: 5% sequential 5% sequential 5% sequential 5% sequential
1% sequential: 1% sequential 1% sequential 1% sequential 1% sequential
4 cores: 16 cores: 64 cores:
256 cores:
4 cores: 16 cores: 64 cores:
256 cores:
4 cores: 16 cores: 64 cores:
256 cores:
Sub-Linear Speedup
Even small sequential parts dominate!
No matter how many cores you may add, even a very small sequential part will prevent scalability
Assumption: in most problems the fraction S/P is fixed
●
●
Even small sequential parts dominate?
Nope… can you give me some example problems?
Particle interactions (ass2)? S=n
P = n*n
Merge sort (ass1)?
S = n (last merge step) P = n*log(n)
//well, not exactly… first fork-join only divide in 2, then 4, then 8… //thus it takes a while for massive parallelism to be usable..
//we can account for this by increasing S a little.
●
●
●
●
Given more processing power, we can solve bigger problems in the same amount of time:
●
SPPPPS
SPPPPS PPPP
SPPPPS PPPP PPPP PPPP
1 core -> 6/6 = 1X
2 cores -> 10/6 = 1.7X
Linear Speedup
1M cores -> 1M/6 =167KX SpeedupG = (S + (P*NCores)) / (S+P) //Gustafson
SPPPPS
PPPP
PPPP PPPP
PPPP PPPP PPPP
4 cores ->
18/6 = 3X
●
Why do you care about doing the same thing over an over again; the advantage of having a better machine is that you can do MORE before the deadline!
For example, more precise/fine grained weather forecast
●
What is speedup?
That is not what I meant with speedup!
Speed up is how much faster you can do the same thing over and over again
● ●
●
●
At least they agree if the problem is 100% parallel: (0+100) / (0+100/X ) = (0+100*X) / (0+100) = X
2 speedup laws?
●
– –
–
●
Does it makes sense to try to be very precise? Anyway we are ignoring all the overhead, and cache issues, and more!
2 speedup laws? If there are two laws, there could be more.
Case by case analysis
Sequential asymptotic complexity + Parallel asymptotic complexity?
Trees and nested Fork-Join analysis, where every FJ have a sequential and a parallel part and then we reason with recursion?
When to apply them:
The problem is fixed, and you are too slow at handling it!
S and P grows at about the same speed when the problem grows.
When you are forced to use/update large synchronized data-structures.
You need efficiency in order to do MORE.
S grows so much slower then P when the problem grows.
You can design tree-based datastructures and often locking the whole thing is not needed.
●
●
●
● ●
●
● ● ●
●
– – – – –
● ●
Take your time, focus on write CORRECT answers. Please, write in an understandable way !!!!
Term test 2
2 hours
17 June Start: 9:30 AM Duration 2h Room: MCLT101 Closed book
Example of important concepts:
how workers can get interrupted and when
How checked/unchecked exceptions are handled synchronize-block exact semantic
streams!!
actors
●
–
●
–
–
●
–
●
–
When is better to use streams instead of actors? and why ?
Term test 2 Some conceptual parts, like
If we sort a list with 10 elements using 256 cores, what kind of speed up can we hope to get?
Some coding parts, like
Parallelize this code while keeping the same behavior (using streams/futures)
Use actors correctly to encode this scenario Some theory parts, like
What is the critical section? what are the needed assumptions?
Some advantages/disadvantages parts, like