CS计算机代考程序代写 c/c++ compiler c++ Fortran Microsoft PowerPoint – COMP528 HAL18 OpenMP.pptx

Microsoft PowerPoint – COMP528 HAL18 OpenMP.pptx

Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528

COMP528: Multi-core and
Multi-Processor Programming

18 – HAL

Shared .v. Dist Mem

Core 0 Core 1 Core 2 Core 3

How we start our OpenMP
code is a different approach
to starting MPI.

The use of the memory for
OpenMP is very different to
the MPI approach

We cover these points
shortly…

• Thread based

• Shared Memory

• Fork-Join model

• Parallelisation via
WORK-SHARING and
TASKS

• FORTRAN, C, C++

• Directives + Environmentals +
Run Time

• OpenMP version 4.5
 parallel regions

 work sharing constructs

 data clauses

 synchronisation

 tasks

 accelerators (sort of!)

Home

History

• 1997: OMP 1

• 2000-2002: OMP 2

• 2005: OMP 2.5 (combined F & C/C++ specs)

• 2008: OMP 3 (tasks)

• 2011: OMP 3.1 (better tasks)

• 2013: OMP 4 (offloading, …) ie for GPU, XeonPhi

• 2015: OMP 4.5 (improved offloading, …)

• 2018: OMP 5.0 ???

Example

• Approximate integral is
sum of areas under line

• Each area approximated
by a rectangle

• Calculate these in parallel…

0

2

4

6

8

10

12

0 0.5 1 1.5 2 2.5 3 3.5

x x+h

0.5*[f(x)+f(x+h)]

#include
#include
#include

double func(double);

int main(void) {
double a=0.0, b=6000.0;
int num=10000000; // num of traps to sum
double stepsize=(b-a)/(double) num;
double x, sum=0.0; // x-space and local summation
int i;
double t0, t1; // timers

t0 = omp_get_wtime();

#pragma omp parallel for default(none) \
shared(num, a, stepsize) private(i,x) reduction(+:sum)
for (i=0; i compound “omp parallel for”

• Creates team of threads (as per omp parallel) &
distributes the iterations of the for loop across the threads of
theses threads (as per omp for)

…serial

#pragma omp parallel

{

… replicated

#pragma omp for

for (iters) {

work(iters)

}

… maybe more replicated

… maybe other work sharing

… maybe some synchronised

}

… more serial

#pragma omp parallel has to be
followed by a structured block
(e.g. set of parentheses, with initial
parenthesis on a new line)

structured block => scope of the
parallel region (i.e. what is done in
parallel)

…serial

#pragma omp parallel

{

… replicated

#pragma omp for

for (iters) {

work(iters)

}

… maybe more replicated

… maybe other work sharing

… maybe some synchronised

}

… more serial

fork: set up parallel region; a team of threads

join: end of parallel region;
thread team dissolved

scope
of the
parallel
region

…serial

#pragma omp parallel

{

… replicated

#pragma omp for

for (iters) {

work(iters)

}

… maybe more replicated

… maybe other work sharing

… maybe some synchronised

}

… more serial

without work-sharing, commands are replicated

…serial

#pragma omp parallel

{

… replicated

#pragma omp for

for (iters) {

work(iters)

}

… maybe more replicated

… maybe other work sharing

… maybe some synchronised

}

… more serial

without work-sharing, commands are replicated
within a parallel construct we can “work share”
over the current thread team eg “omp for”

The scope of an “omp for” is the “for” loop
immediately following the #pragma.

Some conditions on the loop:
• cannot branch out of loop prematurely
• bounds known at (run time) entry to the #pragma

(& do not change during execution of loop)

…serial

#pragma omp parallel

{

… replicated

#pragma omp for

for (iters) {

work(iters)

}

… maybe more replicated

… maybe other work sharing

… maybe some synchronised

}

… more serial

within a parallel construct we can “work share”
over the current thread team eg “omp for”

The “iters” are spread over the threads in the thread
team… according to some run time “schedule”
Eg for 12 iters (0,…,11) and 3 threads (#0, #1, #2)

#0 #1 #2

0, 1, 2, 3 4, 5, 6, 7 8, 9, 10, 11

without work-sharing, commands are replicated
#0 #1 #2

0,1,2,3,4,5,6,7,8,9,10,11 0,1,2,3,4,5,6,7,8,9,10,11 0,1,2,3,4,5,6,7,8,9,10,11

OR
#0 #1 #2

0, 3, 6, 9 1, 4, 7, 10 2, 5, 8, 11

Without work sharing, code goes no faster

• Replication
– Each & every thread is doing all

the work that would be done if
code region was serial

– But when might be useful?

• Work-sharing
– Divides up work between threads

– Eg different threads do different
iterations of the global iteration
space of a for loop

– But does this mean that 4 threads
will do a given work-sharing 4x
faster?

OpenMP Performance

• Parallel
– But also work sharing (or tasks)

– But also load balancing (more later)

– … and also “granularity” of parallel regions:

1. … input initial (x[i],f[i]) for N particles

2. … determine best fit F=mX+C

3. … calc newF[i] as diff of f[i] from best fit at x[i]

for (i=0; i