Microsoft PowerPoint – COMP528 HAL06 practical parallelism.pptx
Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528
COMP528: Multi-core and
Multi-Processor Programming
6 – HAL
Practical
Parallelism
• Much covered to data has been idealized theory of applying
parallelism
– Data parallelism .v. Task parallelism .v. Pipeling
– Amdahl’s Law & strong scaling
• But life is much more complicated…
Practical Parallelism Examples
• Amdahl’s Law
– Presumes time spent in “serial proportion” remains constant
– BUT
• Where does the cost of communications go?
It’s not fixed wrt #p, and it doesn’t always scale linearly with #p
• As use more cores then more options for communication patterns
• Some will scale linearly with #cores, others do better
• e.g. for each core to send data to a ‘root’ core
– (p-1) steps if each do it separately and in turn
– ln(p) steps possible via binomial tree broadcast ==> so does better
Practical Parallelism Examples
• Amdahl’s Law
• Domain Decomposition
– Used for data parallelism
– Split the data, do same set of operations
– So equally splitting data to each core should mean no load
imbalance
– BUT…
Consider…
• Modelling the temperature change of areas of the Earth due
solely to the sun, over a 24 hour period.
• We can parallelise working out these temperatures:
1. cover the Earth with a grid of “cells” – domain decomposition
2. No interaction between cells (unrealistic but enables easy parallelisation)
3. Temp for a given cell is current temp + warming from sun
4. Warming from the sun is max at noon, zero at night
Consider…
• Modelling the temperature change of areas of the Earth due
solely to the sun, over a 24 hour period.
• We can parallelise working out these temperatures:
1. cover the Earth with a grid of “cells” – domain decomposition
2. No interaction between cells (unrealistic but enables easy parallelisation)
3. Temp for a given cell is current temp + warming from sun
4. Warming from the sun is max at noon, zero at night
• What happens if we say that the #flops to determine
temperature rise due to the sun varies between land and sea,
with land calculation taking 10x more than sea calculation
land calculation taking 10x more than sea
calculation
#FLOPS per core per timestep
Core1: 1+1+1+1+1+10+1+10 = 26 (6 sea, 2 land)
Cores 4 & 5: 8*10 = 80
So if have to do every cell’s update for a
given timestep before moving on to the
next timestep for any cell, then integration
only proceeds at the speed of the slowest:
those requiring 26 FLOPS will have to wait
for those undertaking 80 FLOPS to finish
COMP328/COMP528 (c) mkbane, University of Liverpool
land calculation taking 10x more than sea
calculation
#FLOPS per core per timestep
Core1: 26
Cores 4 & 5: 80
2nd attempt: row based domain decomp (a)
with same 5 cores
Core 1 (purple):
(1+1+1+10+10) + (5*10) = 73
Blue: 1+10+1+10+10+10+10+1+10+10 = 73
Yellow: 82
Pink: 41, Green: 32
Yellow has most work @82 FLOPS
BEST POSSIBLE?
land calculation taking 10x more than sea
calculation
#FLOPS per core per timestep
Core1: 26
Cores 4 & 5: 80
2nd attempt: row based domain decomp (a)
with same 5 cores
Core 1 (purple):
(1+1+1+10+10) + (5*10) = 73
Blue: 1+10+1+10+10+10+10+1+10+10 = 73
Yellow: 82
Pink: 41, Green: 32
Yellow has most work @82 FLOPS
BEST POSSIBLE?
land=28*10=280; Sea=12*1=12
Total FLOPs=292;
Ave FLOPS per core = 292/5=58.4
i.e. should be possible to do a timestep in the
time takes to do 59 FLOPS
S
S S
S
S
S S
S
S
S
S
S
#FLOPS per core per timestep
Core1: 26
Cores 4 & 5: 80
2nd attempt: row based domain decomp (a)
with same 5 cores
Yellow has most work @82 FLOPS
BEST POSSIBLE?
land=28*10=280; Sea=12*1=12
Total FLOPs=292;
Ave FLOPS per core = 292/5=58.4
i.e. should be possible to do a timestep in the
time takes to do 59 FLOPS
SEA: 12/5 =2 with 2 remainder
LAND: 28/5 = 5 with 3 remainder
S
S S
S
S
S S
S
S
S
S
S
COMP328/COMP528 (c) mkbane, University of Liverpool
#FLOPS per core per timestep
Core1: 26
Cores 4 & 5: 80
2nd attempt: row based domain decomp (a)
with same 5 cores
Yellow has most work @82 FLOPS
BEST POSSIBLE?
Total FLOPs=292;
Ave FLOPS per core = 292/5=58.4
i.e. should be possible to do a timestep in the
time takes to do 59 FLOPS
SEA: 12/5 =2 with 2 remainder (Purple, Blue)
LAND:
28/5 = 5 with 3 remainder (Yellow, Pink, Green)
Purple & Blue = 3*1 + 5*10 = 53
Yellow, Pink, Green: 2*1 + 6 *10 = 62
This domain decomposition takes
the time of 62 FLOPS (~3/4 of time of naïve)
S
S S
S
S
S S
S
S
S
S
S
It’s not just about going quicker
• We have a model that requires 10 GB of memory to model air flow in a city but the
resolution needs improving. Our current workstation has 2 processors and a total of
12 GB of memory. Our local HPC has many nodes identical to that in the workstation.
• If memory requirements are approximately proportional to resolution^3
1. How much memory would we require to double the resolution?
2. How many nodes will we require on the HPC cluster?
3. How much faster will we get the solution? (presuming good scaling)?
It’s not just about going quicker
• We have a model that requires 10 GB of memory to model air flow in a city but the
resolution needs improving. Our current workstation has 2 processors and a total of
12 GB of memory. Our local HPC has many nodes identical to that in the workstation.
• If memory requirements are approximately proportional to resolution^3
1. How much memory would we require to double the resolution?
2. How many nodes will we require on the HPC cluster?
3. How much faster will we get the solution? (presuming good scaling)?
• 10*2^3 = 80 GB
• 80/12 = 6.67 ==> 7 nodes required
COMP328/COMP528 (c) mkbane, University of Liverpool
It’s not just about going quicker
• We have a model that requires 10 GB of memory to model air flow in a city but the
resolution needs improving. Our current workstation has 2 processors and a total of
12 GB of memory. Our local HPC has many nodes identical to that in the workstation.
• If memory requirements are approximately proportional to resolution^3
1. How much memory would we require to double the resolution?
2. How many nodes will we require on the HPC cluster?
3. How much faster will we get the solution? (presuming good scaling)?
• 10*2^3 = 80 GB
• 80/12 = 6.67 ==> 7 nodes required
• Faster? 7 nodes not 1. So 7x avail processing power
BUT 8x more points to process so slower, about 8/7 times longer to run than
before
It’s not just about going quicker
• We have a model that requires 10 GB of memory to model air flow in a city but the
resolution needs improving. Our current workstation has 2 processors and a total of
12 GB of memory. Our local HPC has many nodes identical to that in the workstation.
• If memory requirements are approximately proportional to resolution^3
1. How much memory would we require to double the resolution?
2. How many nodes will we require on the HPC cluster?
3. How much faster will we get the solution (presuming good scaling)?
• 10*2^3 = 80 GB
• 80/12 = 6.67 ==> 7 nodes required
• Faster? 7 nodes not 1. So 7x avail processing power
BUT 8x more points to process so slower, about 8/7 times longer to run than
before
To understand, appreciate we were using 10 GB (not the full 12 GB) so we have scaled to “just fit” in the
HPC nodes. Quicker turnaround in terms of queueing for compute nodes on a busy system when request
fewer nodes. But if we scale by asking for 8 HPC nodes we would go same speed but 8x larger problem
Challenges for
Parallelism
Fast is good… bigger is good…
BUT…
Thinking parallel
• Different aspects:
• Decomposition: how to decompose your problem into concurrent tasks;
• Scaling: how to ensure that there are enough concurrent tasks to keep all
the processor cores busy;
• Correctness: how to ensure that the result is free from errors;
• Abstractions and Patterns: how to choose and utilize algorithms
• ….
Correctness
• Correctness of sequential programs is not a trivial issue,
even more so is correctness of parallel programs
• If a program is executed in parallel, i.e using multiple
threads of control, processes running concurrently and
independently, then the precise order of [global] operations
potentially be different
• That may lead to..
Correctness (cont.)
• That may lead to
– Non-deterministic behaviour
– Different results from run to run, not all of them may be correct;
• Round-off errors
– Failures in coordination of tasks
• Deadlocks
• Race conditions
Order may matter
• Presume machine that can only give 3 significant figures
• Find sum of
10 lots of 0.1
3 lots of 20
1 lot of 100
COMP328/COMP528 (c) mkbane, University of Liverpool
Order may matter
• Presume machine that can only give 3 significant figures
• Find sum of
10 lots of 0.1
3 lots of 20
1 lot of 100
• 0.1 + 0.1 = 0.2
0.2 + 0.1 = 0.3
0.3 + 0.1 = 0.4
0.4 + 0.1 = 0.5
0.5 + 0.1 = 0.6
0.6 + 0.1 = 0.7
0.7 + 0.1 = 0.8
0.8 + 0.1 = 0.9
0.9+ 0.1 = 1.0
1.0 + 20 = 21
21 + 20 = 41
41 + 20 = 61
61 + 100 = 161
COMP328/COMP528 (c) mkbane, University of Liverpool
Order may matter
• Presume machine that can only give 3 significant figures
• Find sum of
10 lots of 0.1
3 lots of 20
1 lot of 100
• 100 + 20 = 120
120 + 20 = 140
140 + 20 = 160
160 + 0.1 = 160
160 + 0.1 = 160
160 + 0.1 = 160
…
160 + 0.1 = 160
Order may matter
• Presume machine that can only give 3 significant figures
• Find sum of
10 lots of 0.1
3 lots of 20
1 lot of 100
• 100 + 20 = 120
120 + 20 = 140
140 + 20 = 160
160 + 0.1 = 160
160 + 0.1 = 160
160 + 0.1 = 160
…
160 + 0.1 = 160
• 0.1 + 0.1 = 0.2
0.2 + 0.1 = 0.3
0.3 + 0.1 = 0.4
0.4 + 0.1 = 0.5
0.5 + 0.1 = 0.6
0.6 + 0.1 = 0.7
0.7 + 0.1 = 0.8
0.8 + 0.1 = 0.9
0.9+ 0.1 = 1.0
1.0 + 20 = 21
21 + 20 = 41
41 + 20 = 61
61 + 100 = 161
Same maths but 2 different
answers
Trivial example
BUT
“rounding error” is commonplace
and can be important
• Relevance to multi-core/multi-processor programming
• Dividing up the work to different processing elements (PEs)
• Not always as clear as data parallelism for summation over different Pes
• e.g. the land/sea example, if finding global average temperature over 64 tiles
• Need to find sum of the 64 local temperatures
COMP328/COMP528 (c) mkbane, University of Liverpool
Correctness (cont.)
• That may lead to
– Non-deterministic behaviour
– Different results from run to run, not all of them may be correct;
• Round-off errors
– Failures in coordination of tasks
• Race condition
• Deadlock & Livelock
Race conditions
• Race conditions occur when multiple tasks read from and
write to the same memory without proper synchronization;
• Depending on application can lead to incorrect results
Example of race conditions
X
read x
x = x + 10
write x
read x
x = x + 10
write x
• What is “x” after both PEs completed?
Example of race conditions
X
read x
x = x + 10
write x
read x
x = x + 10
write x
• What is “x” after both PEs completed?
• Presumably, programmer’s “correct” answer is initial value + 20
for (int i=0; i<2; i++) { x = x + 10; } Example of race conditions X read x x = x + 10 write x read x x = x + 10 write x • What is “x” after both PEs completed? • Presumably, programmer’s “correct” answer is initial value + 20 • Actual result could be – initial + 10 – initial + 20 Example of race conditions Thread A Thread B Value of X in memory Read X(44) 44 Add 10 Read X(44) 44 Write X(54) subtract 12 54 Write X(32) 32 Thread A Thread B Value of X in memory Read X(44) 44 Read X (44) subtract 12 44 add 10 Write X(32) 32 Write X(54) 54 X read x x = x + 10 write x read x x = x - 12 write x Initial value of x is 44 Different threads update local copy of x Then each attempts to update memory location 2 possible outcomes: Example of race conditions X read x x = x + 10 write x read x x = x + 10 write x • When 2 or more PEs attempting to update the same memory location there could be a race condition with the result being ill- defined (depending what order of updates actually happens) Example of race conditions X read x x = x + 10 write x read x x = x + 10 write x • What is “x” after both PEs completed? • Presumably, programmer’s “correct” answer is initial value + 20 for (int i=0; i<2; i++) { x = x + 10; } Deadlock & Livelock • Deadlock occurs when at least two tasks wait for each other and each will not resume until some action is taken – e.g. people meet in corridor but cannot pass, and each wait for other to move • Livelock occurs when at tasks involved in a deadlock take action to resolve the original deadlock but in a such a way there is still/another deadlock (& repeat forever)… i.e. no tasks will be able to resume – e.g. people meet in corridor but cannot pass, each moves to the same side of the corridor, so still cannot pass, both move to the opposite side, so still cannot pass • Next time/s – Locks / locking – “atomics” – Reductions – Anatomy of HPC • Programming Languages – MPI (many nodes) – OpenMP (single node) – GPUs Questions via MS Teams / email Dr Michael K Bane, Computer Science, University of Liverpool m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane