Microsoft PowerPoint – COMP528 HAL07 practical parallelism#2.pptx
Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528
COMP528: Multi-core and
Multi-Processor Programming
7 – HAL
Files from Lecture Demos
• Login to Barkla & load the relevant modules
• Files available within my directory:
~mkbane/HPC_DEMOS
• You cannot edit/create files in my directory so you need to copy to your own directory
• Bearing in mind I will update the HPC_DEMOS as we go along
• Bearing in mind you will not want to overwrite any edits of your own!
• rsync -ru -L -t –progress ~mkbane/HPC_DEMOS $HOME
this will recursively copy only newer files, resolving any “symbolic links”, maintaining
timestamps, and showing progress, from my HPC_DEMOS to “HPC_DEMOS” directory
within your home directory
• Typically, for given topic, I will refer to “HPC_DEMOS/topic” subdirectory where there will
be a bunch of files and (ideally) a Makefile
RACE CONDITIONS
Consider the simple summation…
• (presuming we have no rounding error issues)
• If we want to sum up:
SUM = SUM + X(i)
for say i=0,…,9 where x(i) has previously been set to i
using 2 threads, with one thread doing even values of i and other doing increments for
odd values of i
• What is the answer?
• Let’s write an OpenMP program and find out…
• DEMO: HPC_DEMOS/raceCondition
– we will compile & run & compare serial .v. parallel
Race Condition
• Very common “mistake”
• occur when multiple tasks read from and write to the same memory without
proper synchronization;
• Depending on application can lead to incorrect results: let’s look at 2 parallel
processes running concurrently
SUM
sum = sum + X[i] sum = sum + X[i]
RAM (or
L2 cache?)
Race Condition
• Very common “mistake”
• occur when multiple tasks read from and write to the same memory without
proper synchronization;
• Depending on application can lead to incorrect results: let’s look at 2 parallel
processes running concurrently
SUM
sum = sum + X[i] sum = sum + X[i]
Both threads want to load value in SUM
Both threads update their value for SUM
Both threads want to update same physical
memory location
RAM (or
L2 cache?)
sum = sum + X[i]
• Presume this breaks down to
ld sum // from RAM to local register
inc sum by X[i]
str sum // back to RAM
• Also presume thread 0 does iterations i=0,2,4,6,8
and thread 1 does iterations i=1,3,5,7,9
• This is what we naively consider happening…
Thread 0 Value of
SUM in
RAM at end
of step
Thread 1
ld sum
[sum: 0]
0
inc sum
[sum: 0]
str sum 0
ld sum
[sum: 0]
inc sum
[sum: 1]
1 str sum
sum = sum + X[i]
• Presume this breaks down to
ld sum // from RAM to local register
inc sum by X[i]
str sum // back to RAM
• Also presume thread 0 does iterations i=0,2,4,6,8
and thread 1 does iterations i=1,3,5,7,9
Thread 0 Value of
SUM in
RAM at end
of step
Thread 1
ld sum
[sum: 0]
0 ld sum
[sum: 0]
inc sum
[sum: 0]
inc sum
[sum: 1]
str sum 0 or 1? str sum
sum = sum + X[i]
• Presume this breaks down to
ld sum // from RAM to local register
inc sum by X[i]
str sum // back to RAM
• Also presume thread 0 does iterations i=0,2,4,6,8
and thread 1 does iterations i=1,3,5,7,9
Thread 0 Value of
SUM in
RAM at end
of step
Thread 1
ld sum
[sum: 0]
0 ld sum
[sum: 0]
inc sum
[sum: 0]
inc sum
[sum: 1]
str sum 0 or 1? str sum
Thread 0 Value of
SUM in
RAM at end
of step
Thread 1
0 ld sum
[sum: 0]
ld sum
[sum: 0]
0 inc sum
[sum: 1]
inc sum
[sum: 0]
1 str sum
str sum 0
ld sum
[sum: 0]
ld sum
[sum: 0]
0 inc sum
[sum: 3]
inc sum
[sum: 2]
3 str sum
str sum 2
sum = sum + X[i]
• Presume this breaks down to
ld sum // from RAM to local register
inc sum by X[i]
str sum // back to RAM
• Also presume thread 0 does iterations i=0,2,4,6,8
and thread 1 does iterations i=1,3,5,7,9
Thread 0 Value of
SUM in
RAM at end
of step
Thread 1
ld sum
[sum: 0]
0 ld sum
[sum: 0]
inc sum
[sum: 0]
inc sum
[sum: 1]
str sum 0 or 1? str sum
thread 0 iterations i=0,2
thread 1 iterations i=1,3 6
Thread 0 Value of
SUM in
RAM at end
of step
Thread 1
0 ld sum
[sum: 0]
ld sum
[sum: 0]
0 inc sum
[sum: 1]
inc sum
[sum: 0]
1 str sum
str sum 0
ld sum
[sum: 0]
ld sum
[sum: 0]
0 inc sum
[sum: 3]
inc sum
[sum: 2]
3 str sum
str sum 2
Race Conditions
• Multiple threads or processes accessing “randomly” one or more memory
accesses such that the result is indeterminate
• Not just (eg) a SUM
• ? What if SUM is used in an IF statement
• Total flow & thus model prediction could vary between runs
Concurrent access and Locks
• Locking is a mechanism for controlling concurrent access
• Acquiring a lock gives a process (or thread) an exclusive
access to a data structure (e.g. variable)
• Locks can be used to ensure correct (required) behaviour of
the multiple-threads programs
• lock(sum)
ld sum
inc sum by X[i]
str sum
unlock(sum)
Thread 0 Value of
SUM in RAM
at end of
step
Thread 1
lock(sum)
ld sum [sum:
0]
0
inc sum
[sum: 0]
str sum 0
unlock(sum)
lock(sum)
ld sum [sum:
0]
inc sum
[sum: 1]
1 str sum
unlock(sum)
lock(sum)
Thread 1 may
try to obtain
lock but it
will have to
wait until it is
released by
thread 0
There is no rule that
threads take it fairly in
turns to obtain lock
Thread 0 SUM Thread 1
lock(sum)
0 ld sum
[sum: 0]
inc sum
[sum: 1]
str sum
1 unlock(su
m)
lock(sum)
ld sum
[sum: 1]
inc sum
[sum: 0]
str sum 1
unlock(sum)
1 lock(sum)
ld sum
[sum: 1]
inc sum
[sum: 4]
4 str sum
unlock(sum)
lock(sum)
ld sum
[sum:4]
inc sum
[sum:6]
str sum 6
unlock(sum)
Thread 0 Value of
SUM in
RAM at end
of step
Thread 1
0 ld sum
[sum: 0]
ld sum
[sum: 0]
0 inc sum
[sum: 1]
inc sum
[sum: 0]
1 str sum
str sum 0
ld sum
[sum: 0]
ld sum
[sum: 0]
0 inc sum
[sum: 3]
inc sum
[sum: 2]
3 str sum
str sum 2
thread 0 does iterations i=0,2,4,6,8
thread 1 does iterations i=1,3,5,7,9
Parallel Patterns for Concurrency Access
• low level locks
– OpenMP: omp_set_lock(…), omp_unset_lock(…)
and others
– fine grained control
– fairly easy to end up with deadlock
• high level functions that implement some form of locks
– routines for OpenMP and for MPI
– control access, but user does not have to worry about deadlock
Parallel Patterns for Concurrency Access
• high level functions
– OpenMP atomic (variable accessed only by single thread)
– OpenMP critical (region accessed by one thread at a time)
– OpenMP reduction (summation from all threads to master thread)
– MPI_Reduce (summation from all processes to root process)
• other [commutative & associative] maths operations also available
• There now follows 2 demos. One with “critical” and then
one with “atomic”, based upon previous slide.
COMP328/COMP528 (c) .uk
Integrating from x=1 to x=2 (a=1, b=2)
Estimate each area as trapezoidal:
1) Average height = 0.5 (f(x)+f(x+h))
2) Width is h
3) Area is height * width
Each trapezoidal area is independent
Therefore we calculate in parallel
on each of 6 parallel processes
The TOTAL integral
is the SUM of the areas of the 6 trapezoidals
Collecting the results from the parallel
processes and adding them together
and handling access (implicit locking)
is a “reduction” pattern
SUM = sum[0] + sum[1] + … sum[5]
0 1
2 3
4 5
a b
distributed across parallel processeson the master or root process
0 1
2 3
4 5
a b
A B
OP
X
REDUCTION (detailed)
For
X = A op B
But A, B are from threads
indeterministic manner
0 1
2 3
4 5
a b
B A
OP
X
REDUCTION (detailed)
For
X = A op B
But A, B are from threads
indeterministic manner
We don’t know which order the
distributed elements will be
combined
Hence need to be
commutative & associative
COMP328/COMP528 (c) .uk
Parallel Patterns for Concurrency Access
• high level functions
– OpenMP atomic (variable accessed only by single thread)
– OpenMP critical (region accessed by one thread at a time)
– OpenMP reduction (summation from all threads to master thread)
– MPI_Reduce (summation from all processes to root process)
• other [commutative & associative] maths operations also available
• Q: which operations would be supported (as “reduction”) ??
Example of “reduction”
Conclusion
• All discussed aspects of Thinking Parallel approach are
non-trivial
• We need tools and instruments which support Thinking
Parallel at appropriate level of abstraction
ANATOMY OF HPC
Physicals Matter
• On a single node
• All cores can access the RAM
• SHARED memory
• (may have local copies in their
caches)
• For SHARED memory we can use
• OpenMP
• pThreads
• MPI
• But HPC is lots of nodes
• Connected by (fast) interconnect
• Memory per node => DISTRIBUTED
memory
• Needs mechanism for a node to
interact with data on another node
• Typically:
• MPI
• “Message Passing Interface”
SHARED MEMORY
• Memory on node
• Faster access
• Limited to that memory
• … and to those cores
• Programming typically OpenMP (or another
threaded model)
• Directives based
• Incremental changes
• Portable to single core / non-OpenMP
• Single code base
DISTRIBUTED MEMORY
• Access memory of another node
• Latency & bandwidth issues
• IB .v. gigE
• Expandable (memory & nodes)
• Programming 99% always MPI
• Message Passing Interface
• Library calls
• More intrusive
• Different implementations of MPI standard
• Non-portable to non-MPI (without effort)
COMP328/COMP528 (c) .uk
SHARED MEMORY
• Memory on node
• Faster access
• Limited to that memory
• … and to those cores
• Programming typically OpenMP (or another
threaded model)
• Directives based
• Incremental changes
• Portable to single core / non-OpenMP
• Single code base
• Can use MPI too
DISTRIBUTED MEMORY
• Access memory of another node
• Latency & bandwidth issues
• IB .v. gigE
• Expandable (memory & nodes)
• Programming 99% always MPI
• Message Passing Interface
• Library calls
• More intrusive
• Different implementations of MPI standard
• Non-portable to non-MPI (without effort)
COMP328/COMP528 (c) .uk
Parallel hardware
• Flynn’s taxonomy:
– SISD (Single Instruction, Single Data)
– SIMD (Single Instruction, Multiple Data)
– MISD eg where fault tolerance is vital
– MIMD (Multiple Instruction, Multiple Data)
Flynn, Michael J. (September 1972).
“Some Computer Organizations and Their Effectiveness”.
IEEE Trans. Computer. C-21 (9): 948–960. doi:10.1109/TC.1972.5009071.
https://en.wikipedia.org/wiki/Flynn%27s_taxonomy
Parallel hardware
• Not for this course
– SISD (Single Instruction, Single Data): serial
– MISD eg where fault tolerance is vital
• Of interest
– SIMD (Single Instruction, Multiple Data)
• Vectorisation, some OpenMP
• (SIMT: single instr multiple threads) = SIMD + multithreading
– Particularly for GPUs
– MIMD (Multiple Instruction, Multiple Data)
• Options: shared-memory || distributed-memory || hybrid
Questions via MS Teams / email
Dr Michael K Bane, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane