OpenMP 4 – What’s New?
SciNet Developer Seminar
Ramses van Zon
September 25, 2013
Intro to OpenMP
I For shared memory systems.
I Add parallelism to
functioning serial code.
I For C, C++ and Fortran
I http://openmp.org
I Compiler/run-time does a
lot of work for you
I Divides up work
I You tell it how to use
variables, and what to
parallelize.
I Works by adding compiler
directives to code.
Quick Example – C
/* example1.c */
int main()
{
int i,sum;
sum=0;
for (i=0; i<101; i++) sum+=i; return sum−5050; } > $CC example1.c
> ./a.out
⇒
/* example1.c */
int main()
{
int i,sum;
sum=0;
#pragma omp parallel
#pragma omp for reduction(+:sum)
for (i=0; i<101; i++)
sum+=i;
return sum−5050;
}
> $CC example1.c -fopenmp
> export OMP NUM THREADS=8
> ./a.out
Quick Example – Fortran
program example1
integer i,sum
sum=0
do i=1,100
sum=sum+i
end do
print *, sum−5050;
end program example1
> $FC example1.f90
⇒
program example1
integer i,sum
sum=0
!$omp parallel
!$omp do reduction(+:sum)
do i=1,100
sum=sum+i
end do
!$omp end parallel
print *, sum−5050;
end program example1
> $FC example1.f90 -fopenmp
Memory Model in OpenMP (3.1)
Execution Model in OpenMP
Execution Model in OpenMP with Tasks
Existing Features (OpenMP 3.1)
1. Create threads with shared and private memory;
2. Parallel sections and loops;
3. Different work scheduling algorithms for load balancing loops;
4. Lock, critical and atomic operations to avoid race conditions;
5. Combining results from different threads;
6. Nested parallelism;
7. Generating task to be executed by threads.
Supported by GCC, Intel, PGI and IBM XL compilers.
Introducing OpenMP 4.0
I Released July 2013, OpenMP 4.0 is an API specification.
I As usual with standards, it’s a mix of features that are
commonly implemented in another form and ones that have
never been implemented.
I As a result, compiler support varies. E.g. Intel compilers
v. 14.0 good at offloading to phi, gcc has more task support.
I OpenMP 4.0 is 248 page document (without appendices)
(OpenMP 1 C/C++ or Fortran was ≈ 40 pages)
I No examples in this specification, no summary card either.
I But it has a lot of new features. . .
New Features in OpenMP 4.0
1. Support for compute devices
2. SIMD constructs
3. Task enhancements
4. Thread affinity
5. Other improvements
1. Support for Compute Devices
I Effort to support a wide variety of
compute devices:
GPUs, Xeon Phis, clusters(?)
I OpenMP 4.0 adds mechanisms to
describe regions of code where data
and/or computation should be moved to
another computing device.
I Moves away from shared memory per se.
I omp target.
Memory Model in OpenMP 4.0
Memory Model in OpenMP 4.0
I Device has its own data environment
I And its own shared memory
I Threads can be bundled in a teams of threads
I These threads can have memory shared among threads of the
same team
I Whether this is beneficial depends on the memory architecture
of the device. (team ≈ CUDA thread blocks, MPI COMM?)
Data mapping
I Host memory and device memory usually district.
I OpenMP 4.0 allows host and device memory to be shared.
I To accommodate both, the relation between variables on host
and memory gets expressed as a mapping
Different types:
I to: existing host variables copied to a corresponding variable
in the target before
I from: target variables copied back to a corresponding variable
in the host after
I tofrom: Both from and to
I alloc: Neither from nor to, but ensure the variable exists on
the target but no relation to host variable.
Note: arrays and array sections are supported.
OpenMP Device Example using target
/* example2.c */
#include
#include
int main()
{
int host threads, trgt threads;
host threads = omp get max threads();
#pragma omp target map(from:target threads)
trgt threads = omp get max threads();
printf(“host_threads = %d\n”, host threads);
printf(“trgt_threads = %d\n”, trgt threads);
}
> $CC -fopenmp example2.c -o example2
> ./example2
host threads = 16
trgt threads = 224
OpenMP Device Example using target
program example2
use omp lib
integer host threads, trgt threads
host threads = omp get max threads()
!$omp target map(from:target threads)
trgt threads = omp get max threads();
!$omp end target
print *, “host threads =”, host threads
print *, “trgt threads =”, trgt threads
end program example2
> $FC -fopenmp example2.f90 -o example2
> ./example2
host threads = 16
trgt threads = 224
OpenMP Device Example using teams, distribute
#include
#include
int main()
{
int ntprocs;
#pragma omp target map(from:ntprocs)
ntprocs = omp get num procs();
int ncases=2240, nteams=4, chunk=ntprocs*2;
#pragma omp target
#pragma omp teams num teams(nteams) thread limit(ntprocs/nteams)
#pragma omp distribute
for (int starti=0; starti
#define N 262144
int main()
{
long long d1=0;
double a[N], b[N], c[N], d2=0.0;
#pragma omp simd reduction(+:d1)
for (int i=0;i
#pragma omp declare simd
double computeb(int i)
{ return N+1−i; }
#define N 262144
int main()
{
long long d1=0;
double a[N], b[N], c[N], d2=0.0;
#pragma omp simd reduction(+:d1)
for (int i=0;i
#define N 40
int main()
{
char haystack[N+1]=”abcabcabczabcabcabcxabcabcabczabcabcabcz”;
char needle=’x’;
int pos;
#pragma omp parallel for
for (int i=0; i
> export OMP_NUM_THREADS=16
> export OMP_PLACES=0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15
> export OMP_PROC_BIND=spread,close
> ./example
…
5. Other improvements
I User-defined reductions:
Previously, OpenMP API only supported reductions with base
language operators and intrinsic procedures. With OpenMP
4.0 API, user-defined reductions are now also supported.
omp declare reduction
I Sequentially consistent atomics:
A clause has been added to allow a programmer to enforce
sequential consistency when a specific storage location is
accessed atomically.
omp atomic seq cst
I Optional dump all internal variables at program start:
OMP DISPLAY ENV=TRUE|FALSE|VERBOSE
Thank you for your attention.
Have fun exploring!
http://openmp.org/wp/openmp-specifications