Microsoft PowerPoint – COMP528 HAL18 OpenMP.pptx
Dr Michael K Bane, G14, Computer Science, University of Liverpool
m.k. .uk https://cgi.csc.liv.ac.uk/~mkbane/COMP528
COMP528: Multi-core and
Multi-Processor Programming
18 – HAL
Shared .v. Dist Mem
Core 0 Core 1 Core 2 Core 3
How we start our OpenMP
code is a different approach
to starting MPI.
The use of the memory for
OpenMP is very different to
the MPI approach
We cover these points
shortly…
• Thread based
• Shared Memory
• Fork-Join model
• Parallelisation via
WORK-SHARING and
TASKS
• FORTRAN, C, C++
• Directives + Environmentals +
Run Time
• OpenMP version 4.5
parallel regions
work sharing constructs
data clauses
synchronisation
tasks
accelerators (sort of!)
History
• 1997: OMP 1
• 2000-2002: OMP 2
• 2005: OMP 2.5 (combined F & C/C++ specs)
• 2008: OMP 3 (tasks)
• 2011: OMP 3.1 (better tasks)
• 2013: OMP 4 (offloading, …) ie for GPU, XeonPhi
• 2015: OMP 4.5 (improved offloading, …)
• 2018: OMP 5.0 ???
Example
• Approximate integral is
sum of areas under line
• Each area approximated
by a rectangle
• Calculate these in parallel…
0
2
4
6
8
10
12
0 0.5 1 1.5 2 2.5 3 3.5
x x+h
0.5*[f(x)+f(x+h)]
#include
#include
#include
double func(double);
int main(void) {
double a=0.0, b=6000.0;
int num=10000000; // num of traps to sum
double stepsize=(b-a)/(double) num;
double x, sum=0.0; // x-space and local summation
int i;
double t0, t1; // timers
t0 = omp_get_wtime();
#pragma omp parallel for default(none) \
shared(num, a, stepsize) private(i,x) reduction(+:sum)
for (i=0; i
• Creates team of threads (as per omp parallel) &
distributes the iterations of the for loop across the threads of
theses threads (as per omp for)
…serial
#pragma omp parallel
{
… replicated
#pragma omp for
for (iters) {
work(iters)
}
… maybe more replicated
… maybe other work sharing
… maybe some synchronised
}
… more serial
#pragma omp parallel has to be
followed by a structured block
(e.g. set of parentheses, with initial
parenthesis on a new line)
structured block => scope of the
parallel region (i.e. what is done in
parallel)
…serial
#pragma omp parallel
{
… replicated
#pragma omp for
for (iters) {
work(iters)
}
… maybe more replicated
… maybe other work sharing
… maybe some synchronised
}
… more serial
fork: set up parallel region; a team of threads
join: end of parallel region;
thread team dissolved
scope
of the
parallel
region
…serial
#pragma omp parallel
{
… replicated
#pragma omp for
for (iters) {
work(iters)
}
… maybe more replicated
… maybe other work sharing
… maybe some synchronised
}
… more serial
without work-sharing, commands are replicated
…serial
#pragma omp parallel
{
… replicated
#pragma omp for
for (iters) {
work(iters)
}
… maybe more replicated
… maybe other work sharing
… maybe some synchronised
}
… more serial
without work-sharing, commands are replicated
within a parallel construct we can “work share”
over the current thread team eg “omp for”
The scope of an “omp for” is the “for” loop
immediately following the #pragma.
Some conditions on the loop:
• cannot branch out of loop prematurely
• bounds known at (run time) entry to the #pragma
(& do not change during execution of loop)
…serial
#pragma omp parallel
{
… replicated
#pragma omp for
for (iters) {
work(iters)
}
… maybe more replicated
… maybe other work sharing
… maybe some synchronised
}
… more serial
within a parallel construct we can “work share”
over the current thread team eg “omp for”
The “iters” are spread over the threads in the thread
team… according to some run time “schedule”
Eg for 12 iters (0,…,11) and 3 threads (#0, #1, #2)
#0 #1 #2
0, 1, 2, 3 4, 5, 6, 7 8, 9, 10, 11
without work-sharing, commands are replicated
#0 #1 #2
0,1,2,3,4,5,6,7,8,9,10,11 0,1,2,3,4,5,6,7,8,9,10,11 0,1,2,3,4,5,6,7,8,9,10,11
OR
#0 #1 #2
0, 3, 6, 9 1, 4, 7, 10 2, 5, 8, 11
Without work sharing, code goes no faster
• Replication
– Each & every thread is doing all
the work that would be done if
code region was serial
– But when might be useful?
• Work-sharing
– Divides up work between threads
– Eg different threads do different
iterations of the global iteration
space of a for loop
– But does this mean that 4 threads
will do a given work-sharing 4x
faster?
OpenMP Performance
• Parallel
– But also work sharing (or tasks)
– But also load balancing (more later)
– … and also “granularity” of parallel regions:
1. … input initial (x[i],f[i]) for N particles
2. … determine best fit F=mX+C
3. … calc newF[i] as diff of f[i] from best fit at x[i]
for (i=0; i