程序代写代做代考 c/c++ compiler cuda c++ Fortran algorithm OpenMP 4 – What’s New?

OpenMP 4 – What’s New?

SciNet Developer Seminar

Ramses van Zon

September 25, 2013

Intro to OpenMP

I For shared memory systems.

I Add parallelism to
functioning serial code.

I For C, C++ and Fortran

I http://openmp.org

I Compiler/run-time does a
lot of work for you

I Divides up work

I You tell it how to use
variables, and what to
parallelize.

I Works by adding compiler
directives to code.

Quick Example – C

/* example1.c */
int main()
{
int i,sum;
sum=0;

for (i=0; i<101; i++) sum+=i; return sum−5050; } > $CC example1.c

> ./a.out

/* example1.c */
int main()
{
int i,sum;
sum=0;
#pragma omp parallel

#pragma omp for reduction(+:sum)
for (i=0; i<101; i++) sum+=i; return sum−5050; } > $CC example1.c -fopenmp

> export OMP NUM THREADS=8

> ./a.out

Quick Example – Fortran

program example1

integer i,sum
sum=0

do i=1,100
sum=sum+i

end do

print *, sum−5050;
end program example1

> $FC example1.f90

program example1

integer i,sum
sum=0
!$omp parallel
!$omp do reduction(+:sum)
do i=1,100

sum=sum+i
end do

!$omp end parallel
print *, sum−5050;

end program example1

> $FC example1.f90 -fopenmp

Memory Model in OpenMP (3.1)

Execution Model in OpenMP

Execution Model in OpenMP with Tasks

Existing Features (OpenMP 3.1)

1. Create threads with shared and private memory;

2. Parallel sections and loops;

3. Different work scheduling algorithms for load balancing loops;

4. Lock, critical and atomic operations to avoid race conditions;

5. Combining results from different threads;

6. Nested parallelism;

7. Generating task to be executed by threads.

Supported by GCC, Intel, PGI and IBM XL compilers.

Introducing OpenMP 4.0

I Released July 2013, OpenMP 4.0 is an API specification.

I As usual with standards, it’s a mix of features that are
commonly implemented in another form and ones that have
never been implemented.

I As a result, compiler support varies. E.g. Intel compilers
v. 14.0 good at offloading to phi, gcc has more task support.

I OpenMP 4.0 is 248 page document (without appendices)
(OpenMP 1 C/C++ or Fortran was ≈ 40 pages)

I No examples in this specification, no summary card either.

I But it has a lot of new features. . .

New Features in OpenMP 4.0

1. Support for compute devices

2. SIMD constructs

3. Task enhancements

4. Thread affinity

5. Other improvements

1. Support for Compute Devices

I Effort to support a wide variety of
compute devices:

GPUs, Xeon Phis, clusters(?)

I OpenMP 4.0 adds mechanisms to
describe regions of code where data
and/or computation should be moved to
another computing device.

I Moves away from shared memory per se.

I omp target.

Memory Model in OpenMP 4.0

Memory Model in OpenMP 4.0

I Device has its own data environment

I And its own shared memory

I Threads can be bundled in a teams of threads

I These threads can have memory shared among threads of the
same team

I Whether this is beneficial depends on the memory architecture
of the device. (team ≈ CUDA thread blocks, MPI COMM?)

Data mapping

I Host memory and device memory usually district.

I OpenMP 4.0 allows host and device memory to be shared.

I To accommodate both, the relation between variables on host
and memory gets expressed as a mapping

Different types:
I to: existing host variables copied to a corresponding variable

in the target before
I from: target variables copied back to a corresponding variable

in the host after
I tofrom: Both from and to
I alloc: Neither from nor to, but ensure the variable exists on

the target but no relation to host variable.

Note: arrays and array sections are supported.

OpenMP Device Example using target

/* example2.c */
#include
#include
int main()
{
int host threads, trgt threads;
host threads = omp get max threads();
#pragma omp target map(from:target threads)
trgt threads = omp get max threads();
printf(“host_threads = %d\n”, host threads);
printf(“trgt_threads = %d\n”, trgt threads);
}

> $CC -fopenmp example2.c -o example2

> ./example2

host threads = 16

trgt threads = 224

OpenMP Device Example using target

program example2

use omp lib

integer host threads, trgt threads
host threads = omp get max threads()
!$omp target map(from:target threads)
trgt threads = omp get max threads();
!$omp end target
print *, “host threads =”, host threads
print *, “trgt threads =”, trgt threads

end program example2

> $FC -fopenmp example2.f90 -o example2

> ./example2

host threads = 16

trgt threads = 224

OpenMP Device Example using teams, distribute
#include
#include
int main()
{

int ntprocs;
#pragma omp target map(from:ntprocs)
ntprocs = omp get num procs();
int ncases=2240, nteams=4, chunk=ntprocs*2;

#pragma omp target

#pragma omp teams num teams(nteams) thread limit(ntprocs/nteams)
#pragma omp distribute

for (int starti=0; starti
#define N 40

int main()
{

char haystack[N+1]=”abcabcabczabcabcabcxabcabcabczabcabcabcz”;
char needle=’x’;
int pos;
#pragma omp parallel for

for (int i=0; i $CC example.c -fopenmp -o example

> export OMP_NUM_THREADS=16

> export OMP_PLACES=0,8,1,9,2,10,3,11,4,12,5,13,6,14,7,15

> export OMP_PROC_BIND=spread,close

> ./example

5. Other improvements

I User-defined reductions:
Previously, OpenMP API only supported reductions with base
language operators and intrinsic procedures. With OpenMP
4.0 API, user-defined reductions are now also supported.

omp declare reduction

I Sequentially consistent atomics:
A clause has been added to allow a programmer to enforce
sequential consistency when a specific storage location is
accessed atomically.

omp atomic seq cst

I Optional dump all internal variables at program start:

OMP DISPLAY ENV=TRUE|FALSE|VERBOSE

Thank you for your attention.

Have fun exploring!

http://openmp.org/wp/openmp-specifications