CS计算机代考程序代写 Microsoft PowerPoint – COMP528 HAL21 OpenMP sync.pptx

Microsoft PowerPoint – COMP528 HAL21 OpenMP sync.pptx

Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528

COMP528: Multi-core and
Multi-Processor Programming

21 – HAL

RECAP

SHARED MEMORY

• Memory on chip
– Faster access

– Limited to that memory

– … and to those nodes

• Programming typically OpenMP
– Directives based + environment vars

+ run time functions

– Incremental changes to code
(e.g. loop by loop)

– Portable to single core / non-OpenMP
• Single code base (or use of “stubs”)

DISTRIBUTED MEMORY

• Access memory of another node
– Latency & bandwidth issues

– Which interconnect: IB .v. gigE (v. etc)

– Expandable (memory & nodes)

• Programming 99% always MPI
– Message Passing Interface

– Library calls  more intrusive

– Different MPI libs / implementations

– Non-portable to non-MPI (without effort)

COMP328/COMP528 (c) mkbane, university of
liverpool

OpenMP

• Only for “globally addressable” shared
memory
– (generally) only a single node

• Threads

• Fork-join model: parallel regions
– Work-sharing

– Tasks

• Need to think about
private(x) .v. shared (x)

MPI

• For distributed memory
– Includes subset of a single node

• Processes

• Each process on a different core

• Need to think message-passing in
order to share information

OpenMP MPI

Intel modules on Barkla intel intel intel-mpi

compile (with no optimisation) icc –qopenmp -O0 myOMP.c -o myOMP.exe mpiicc -O0 myMPI.c –o myMPI.exe

run on 7 cores export OMP_NUM_THREADS=7
./myOMP.exe

mpirun -np 7 ./myMPI.exe

COMP328/COMP528 (c) mkbane, university of
liverpool

OPENMP: PRIVATE OR SHARED

What is X – private or shared?

It’s all about whether it’s

 safe for all threads to access the same mem loc
=> shared (accessible to all threads)

 unsafe, so each thread needs own copy
=> private (aka local to each thread)

• There are occasional OpenMP standard/specifics about what can be
shared/private (eg for structs, elements of [dynamic] arrays etc)
– see standard if your code does not compile!

shared(X)

 safe for all threads to access the same mem loc => shared

• Only ever read
– shared(X)

– good news: less overhead

• Array X where each thread only reads/writes to
a given X[i] element
– shared(X)

– less overhead but…

– possible danger of “false sharing” if consecutive elements of X are
being written and read at same time

private(X)

 unsafe for all threads to access the same mem loc,
so each thread needs own copy
=> private

• X is a “temporary” variable (eg within a loop)
– it is assigned at start of each loop iteration

(e.g. from other variables’ values)

– it used only within a single loop iteration (so no loop carried dependency)

– each thread is updating X with different values

– could have declared X within the loop itself

– private(X)

Be careful… the above are generalisations

1. Some SYNCHRONISATION constructs to follow
1. can use to “hide”/ “control” parallel updates, so can have

shared(X) where updates to X are protected

2. shared(X) for #pragma omp par
can then nest a private(X) for #pragma omp for

hence need to think carefully…
COMP328/COMP528 (c) mkbane, university of

liverpool

OPENMP SYNCHRONISATION

Why Sync?

• Coarse granularity is better than fine granularity
– but may wish some work only on one thread (not replicated)

• e.g. output of a global sum

• UNLIKE the distributed memory programming with MPI,
– with OpenMP, many threads can access same memory location

• shared memory, shared(a1,az, etc)

• may need capability for ensuring some order in accesses
– to prevent race conditions

• May also wish to synchronise e.g. threads writing to a file

Coarse Grained but NOT Replicated

• Sometimes…
– only want one thread to

execute part of a region

– or just the “master” thread to
do something

– or all threads but only one at
a time

#pragma omp critical [name]
#pragma omp single
#pragma omp master

THE SOUND ON
THIS SLIDE STARTS

IN 10 SECS

Coarse Grained but Not Replicated

• Sometimes…
– only want one thread to

execute part of a region

– or just the “master” thread to
do something

– or all threads but only one at
a time

#pragma omp critical [name]
#pragma omp single
#pragma omp master COMP328/COMP528 (c) mkbane, university of

liverpool

Work on just one thread

#pragma omp single [data clauses]

{

block

}

– only a single thread will execute the block

– implementation dependent how the choice is made

– implicit synchronisation at the end

Work on just one thread

#pragma omp master

{

block

}

– only the master thread (#0) will execute the block

– lower overhead than omp single (WHY?)

– no implicit synchronisations

Controlled Access to a Block of Code

• High level: omp critical (Lower level: omp locks)
#pragma omp critical [name]
{
block

}

• only a single thread at a time will execute the block

• can have more than one critical region in a code
– danger of deadlock:

– un-named critical regions have the same system-defined name

– and the restriction on executing the block applies to all same named critical regions

• can name a critical region

• can also give implementation-dependent hint

“But I don’t understand…”

what we gonna do?

1. go to the standard (2.13.2)

“But I don’t understand…”

what we gonna do?

1. go to the standard (2.13.2)

“but I still don’t get it”

what we gonna do?

“But I don’t understand…”

what we gonna do?

1. go to the standard (2.13.2)

“but I still don’t get it”

what we gonna do?

1. look at the examples (sect 6.1)

#pragma omp critical

• If you only have one in your code, no need to name

• If you have more, and they are meant to be independent,
then give them different names

Controlled Access to a Block of Code

• High level: omp critical (Lower level: omp locks)
#pragma omp critical [name]
{
block

}

• only a single thread at a time will execute the block

• but what about finer grained, eg controlled access to a variable

COMP328/COMP528 (c) mkbane, university of
liverpool

Controlled Access to a Variable

• High level: omp atomic (Lower level: omp locks)
#pragma omp atomic […]
expression-statement-and-nothing-more;

• NOT for a block, JUST for a single statement

• protects a specific memory location (not a block of code)

• many threads may update the memory location, but only in an
“atomic” manner

Simple Atomic

x = numberInStock;

#pragma omp parallel default (none) shared(x) \
private(y) num_threads(numOutlets)

{
y = workOutNumberSold(omp_get_thread_num());

#pragma omp atomic
x = x – y;

} // omp par

numberToOrder = estimatedBestStock(tomorrow) – x;

Less Simple Atomic

x = numberInStock;

#pragma omp parallel default (none) shared(x) \
private(y) num_threads(numOutlets)

{
// y = workOutNumberSold(omp_get_thread_num());

#pragma omp atomic
x = x – workOutNumberSold(omp_get_thread_num());

} // omp par

numberToOrder = estimatedBestStock(tomorrow) – x;

function evals will
run concurrently

x = numberInStock;

#pragma omp parallel default (none) shared(x) \
private(y) num_threads(numOutlets)

{
// y = workOutNumberSold(omp_get_thread_num());

#pragma omp critical
x = x – workOutNumberSold(omp_get_thread_num());

} // omp par

numberToOrder = estimatedBestStock(tomorrow) – x;

Less Simple Atomic

x = numberInStock;

#pragma omp parallel default (none) shared(x) \
private(y) num_threads(numOutlets)

{
// y = workOutNumberSold(omp_get_thread_num());

#pragma omp atomic
x = x – workOutNumberSold(omp_get_thread_num());

} // omp par

numberToOrder = estimatedBestStock(tomorrow) – x;

each statement done seriallyfunction evals will run concurrently

COMP328/COMP528 (c) mkbane, university of
liverpool

• what OpenMP directive for synchronisation have we seen
already?

• similar to one for MPI

• used in timing, for example (to ensure all threads start at
same point, and we not measuring (accidentally) time due to
load imbalance at earlier time in code execution)

#pragma omp barrier

#pragma omp barrier

• a point where no thread in the team can progress until all
threads encounter the barrier

• could be useful to enforce some ordering

• could be useful for timing only what you want to time

Synchronisation Constructs

• Some more we cover for tasks &/or accelerators

• Some we won’t cover at all

• Summary (so far) of synchronisation constructs:
#pragma omp master

#pragma omp critical

#pragma omp atomic

#pragma omp barrier

Summary: work-sharing constructs

– omp for

– omp single

– omp sections

– omp workshare

• have implicit barrier at exit

• must be encountered by all threads of team

• order that work-sharing and barriers encountered must be
same for all threads of team

Summary: work-sharing constructs (for C)

– omp for

– omp single

– omp sections


have implicit barrier at exit

• must be encountered by all threads of team

• order that work-sharing and barriers encountered must be
same for all threads of team

nowait

• parallel regions, #threads, data clauses

• work-sharing constructs
• scheduling, nowait

• synchronisation

• NOT COVERED: memory model

• NEXT ==> OVERVIEWS re Tasks, Accelerators

Questions via MS Teams / email
Dr Michael K Bane, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane