Microsoft PowerPoint – COMP528 HAL21 OpenMP sync.pptx
Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528
COMP528: Multi-core and
Multi-Processor Programming
21 – HAL
RECAP
SHARED MEMORY
• Memory on chip
– Faster access
– Limited to that memory
– … and to those nodes
• Programming typically OpenMP
– Directives based + environment vars
+ run time functions
– Incremental changes to code
(e.g. loop by loop)
– Portable to single core / non-OpenMP
• Single code base (or use of “stubs”)
DISTRIBUTED MEMORY
• Access memory of another node
– Latency & bandwidth issues
– Which interconnect: IB .v. gigE (v. etc)
– Expandable (memory & nodes)
• Programming 99% always MPI
– Message Passing Interface
– Library calls more intrusive
– Different MPI libs / implementations
– Non-portable to non-MPI (without effort)
COMP328/COMP528 (c) mkbane, university of
liverpool
OpenMP
• Only for “globally addressable” shared
memory
– (generally) only a single node
• Threads
• Fork-join model: parallel regions
– Work-sharing
– Tasks
• Need to think about
private(x) .v. shared (x)
MPI
• For distributed memory
– Includes subset of a single node
• Processes
• Each process on a different core
• Need to think message-passing in
order to share information
OpenMP MPI
Intel modules on Barkla intel intel intel-mpi
compile (with no optimisation) icc –qopenmp -O0 myOMP.c -o myOMP.exe mpiicc -O0 myMPI.c –o myMPI.exe
run on 7 cores export OMP_NUM_THREADS=7
./myOMP.exe
mpirun -np 7 ./myMPI.exe
COMP328/COMP528 (c) mkbane, university of
liverpool
OPENMP: PRIVATE OR SHARED
What is X – private or shared?
It’s all about whether it’s
safe for all threads to access the same mem loc
=> shared (accessible to all threads)
unsafe, so each thread needs own copy
=> private (aka local to each thread)
• There are occasional OpenMP standard/specifics about what can be
shared/private (eg for structs, elements of [dynamic] arrays etc)
– see standard if your code does not compile!
shared(X)
safe for all threads to access the same mem loc => shared
• Only ever read
– shared(X)
– good news: less overhead
• Array X where each thread only reads/writes to
a given X[i] element
– shared(X)
– less overhead but…
– possible danger of “false sharing” if consecutive elements of X are
being written and read at same time
private(X)
unsafe for all threads to access the same mem loc,
so each thread needs own copy
=> private
• X is a “temporary” variable (eg within a loop)
– it is assigned at start of each loop iteration
(e.g. from other variables’ values)
– it used only within a single loop iteration (so no loop carried dependency)
– each thread is updating X with different values
– could have declared X within the loop itself
– private(X)
Be careful… the above are generalisations
1. Some SYNCHRONISATION constructs to follow
1. can use to “hide”/ “control” parallel updates, so can have
shared(X) where updates to X are protected
2. shared(X) for #pragma omp par
can then nest a private(X) for #pragma omp for
hence need to think carefully…
COMP328/COMP528 (c) mkbane, university of
liverpool
OPENMP SYNCHRONISATION
Why Sync?
• Coarse granularity is better than fine granularity
– but may wish some work only on one thread (not replicated)
• e.g. output of a global sum
• UNLIKE the distributed memory programming with MPI,
– with OpenMP, many threads can access same memory location
• shared memory, shared(a1,az, etc)
• may need capability for ensuring some order in accesses
– to prevent race conditions
• May also wish to synchronise e.g. threads writing to a file
Coarse Grained but NOT Replicated
• Sometimes…
– only want one thread to
execute part of a region
– or just the “master” thread to
do something
– or all threads but only one at
a time
#pragma omp critical [name]
#pragma omp single
#pragma omp master
THE SOUND ON
THIS SLIDE STARTS
IN 10 SECS
Coarse Grained but Not Replicated
• Sometimes…
– only want one thread to
execute part of a region
– or just the “master” thread to
do something
– or all threads but only one at
a time
#pragma omp critical [name]
#pragma omp single
#pragma omp master COMP328/COMP528 (c) mkbane, university of
liverpool
Work on just one thread
#pragma omp single [data clauses]
{
block
}
– only a single thread will execute the block
– implementation dependent how the choice is made
– implicit synchronisation at the end
Work on just one thread
#pragma omp master
{
block
}
– only the master thread (#0) will execute the block
– lower overhead than omp single (WHY?)
– no implicit synchronisations
Controlled Access to a Block of Code
• High level: omp critical (Lower level: omp locks)
#pragma omp critical [name]
{
block
}
• only a single thread at a time will execute the block
• can have more than one critical region in a code
– danger of deadlock:
– un-named critical regions have the same system-defined name
– and the restriction on executing the block applies to all same named critical regions
• can name a critical region
• can also give implementation-dependent hint
“But I don’t understand…”
what we gonna do?
1. go to the standard (2.13.2)
“But I don’t understand…”
what we gonna do?
1. go to the standard (2.13.2)
“but I still don’t get it”
what we gonna do?
“But I don’t understand…”
what we gonna do?
1. go to the standard (2.13.2)
“but I still don’t get it”
what we gonna do?
1. look at the examples (sect 6.1)
#pragma omp critical
• If you only have one in your code, no need to name
• If you have more, and they are meant to be independent,
then give them different names
Controlled Access to a Block of Code
• High level: omp critical (Lower level: omp locks)
#pragma omp critical [name]
{
block
}
• only a single thread at a time will execute the block
• but what about finer grained, eg controlled access to a variable
COMP328/COMP528 (c) mkbane, university of
liverpool
Controlled Access to a Variable
• High level: omp atomic (Lower level: omp locks)
#pragma omp atomic […]
expression-statement-and-nothing-more;
• NOT for a block, JUST for a single statement
• protects a specific memory location (not a block of code)
• many threads may update the memory location, but only in an
“atomic” manner
Simple Atomic
x = numberInStock;
#pragma omp parallel default (none) shared(x) \
private(y) num_threads(numOutlets)
{
y = workOutNumberSold(omp_get_thread_num());
#pragma omp atomic
x = x – y;
} // omp par
numberToOrder = estimatedBestStock(tomorrow) – x;
Less Simple Atomic
x = numberInStock;
#pragma omp parallel default (none) shared(x) \
private(y) num_threads(numOutlets)
{
// y = workOutNumberSold(omp_get_thread_num());
#pragma omp atomic
x = x – workOutNumberSold(omp_get_thread_num());
} // omp par
numberToOrder = estimatedBestStock(tomorrow) – x;
function evals will
run concurrently
x = numberInStock;
#pragma omp parallel default (none) shared(x) \
private(y) num_threads(numOutlets)
{
// y = workOutNumberSold(omp_get_thread_num());
#pragma omp critical
x = x – workOutNumberSold(omp_get_thread_num());
} // omp par
numberToOrder = estimatedBestStock(tomorrow) – x;
Less Simple Atomic
x = numberInStock;
#pragma omp parallel default (none) shared(x) \
private(y) num_threads(numOutlets)
{
// y = workOutNumberSold(omp_get_thread_num());
#pragma omp atomic
x = x – workOutNumberSold(omp_get_thread_num());
} // omp par
numberToOrder = estimatedBestStock(tomorrow) – x;
each statement done seriallyfunction evals will run concurrently
COMP328/COMP528 (c) mkbane, university of
liverpool
• what OpenMP directive for synchronisation have we seen
already?
• similar to one for MPI
• used in timing, for example (to ensure all threads start at
same point, and we not measuring (accidentally) time due to
load imbalance at earlier time in code execution)
#pragma omp barrier
#pragma omp barrier
• a point where no thread in the team can progress until all
threads encounter the barrier
• could be useful to enforce some ordering
• could be useful for timing only what you want to time
Synchronisation Constructs
• Some more we cover for tasks &/or accelerators
• Some we won’t cover at all
• Summary (so far) of synchronisation constructs:
#pragma omp master
#pragma omp critical
#pragma omp atomic
#pragma omp barrier
Summary: work-sharing constructs
– omp for
– omp single
– omp sections
– omp workshare
• have implicit barrier at exit
• must be encountered by all threads of team
• order that work-sharing and barriers encountered must be
same for all threads of team
Summary: work-sharing constructs (for C)
– omp for
– omp single
– omp sections
•
have implicit barrier at exit
• must be encountered by all threads of team
• order that work-sharing and barriers encountered must be
same for all threads of team
nowait
• parallel regions, #threads, data clauses
• work-sharing constructs
• scheduling, nowait
• synchronisation
• NOT COVERED: memory model
• NEXT ==> OVERVIEWS re Tasks, Accelerators
Questions via MS Teams / email
Dr Michael K Bane, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane