Basic OpenMP
Wednesday, January 11, 17
What is OpenMP
• An open standard for shared memory programming in
C/C++ and Fortran
• supported by IBM, Intel, Gnu and others
• Compiler directives and library support
• OpenMP programs are typically still legal to execute
sequentially
• Allows program to be incrementally
parallelized
• Can be used with MPI — will discuss that later
Wednesday, January 11, 17
OpenMP Hardware Model
CPU CPU CPU CPU
cache cache cachecache
bus
Memory I/O devices
Uniform
memory
access
shared
memory
machine is
assumed
Wednesday, January 11, 17
Fork/Join Parallelism
• Program execution starts with a single
master thread
• Master thread executes sequential code
• When parallel part of the program is
encountered, a fork utilizes other worker
threads
• At the end of the parallel region, a join kills
or suppends the worker threads
Wednesday, January 11, 17
Parallel execution using
threadsmaster thread
fork spawns
or wakes up
worker threads
join at end of omp
parallel pragma
Green is parallel execution
Red is sequential
For efficiency, worker
threads are suspended, not
killed, at the end of the
execution
reuses
same
threads
from
last
fork
Wednesday, January 11, 17
Where is the work in
programs?
• For many programs, most of the work is in loops
• C and Fortran often use loops to express data
parallel operations
• the same operation applied to many
independent data elements
for (i = first; i < size; i += prime)
marked[i] = 1;
Wednesday, January 11, 17
OpenMP Pragmas
• OpenMP expresses parallelism and other
information using pragmas
• A C/C++ or Fortran compiler is free to ignore a
pragma -- this means that OpenMP programs have
serial as well as parallel semantics
• outcome of the program should be the same in
either case
• #pragma omp
form of a pragma
Wednesday, January 11, 17
pragma for parallel for
• OpenMP programmers use the parallel for pragma
to tell the compiler a loop is parallel
#pragma omp parallel for
for (i=0; i < n; i++) {
a[i] = b[i] + c[i];
Wednesday, January 11, 17
Syntax of the parallel for
control clause
• start is an integer index variable
• rel-op is one of {<, <=, >=, >}
• val is an integer expression
• incr is one of {index++, ++index, index–, –index, index
+=val, index-=val, index=index+val, index=val+index,
index=index-val
• OpenMP needs enough information from the loop to
run the loop on multiple threads
for (index = start; index rel-op val; incr)
Wednesday, January 11, 17
Each thread has an execution
context
• Each thread must be able to access all of the
storage it references
• The execution context contains
• static and global variables
• heap allocated storage
• variables on the stack belonging to functions
called along the way to invoking the thread
• a thread-local stack for functions invoked and
block entered during the thread execution
shared/private
Wednesday, January 11, 17
Example of context
int v1;
…
main( ) {
T1 *v2 = malloc(sizeof(T1));
…
f1( );
}
void f1( ) {
int v3;
#pragma omp parallel for
for (int i=0; i < n; i++) {
int v4;
T2 *v5 = malloc(sizeof(T2));
}}
Consider the program below:
Variables v1, v2, v3 and v4, as
well as heap allocated storage,
are part of the context.
Wednesday, January 11, 17
Context before call to f1
int v1;
...
main( ) {
T1 *v2 = malloc(sizeof(T1));
...
f1( );
}
void f1( ) {
int v3;
#pragma omp parallel for
for (int i=0; i < n; i++) {
int v4;
T2 *v5 = malloc(sizeof(T2));
}}
Storage, assuming two threads
red is shared,
green is private to thread 0,
blue is private to thread 1
statics and globals: v1
heap
T1
global stack
main: v2
Wednesday, January 11, 17
Context right after call to f1
int v1;
...
main( ) {
T1 *v2 = malloc(sizeof(T1));
...
f1( );
}
void f1( ) {
int v3;
#pragma omp parallel for
for (int i=0; i < n; i++) {
int v4;
T2 *v5 = malloc(sizeof(T2));
}}
Storage, assuming two threads
red is shared,
green is private to thread 0,
blue is private to thread 1
statics and globals: v1
heap
T1
global stack
main: v2
foo: v3
Wednesday, January 11, 17
Context at start of parallel for
int v1;
...
main( ) {
T1 *v2 = malloc(sizeof(T1));
...
f1( );
}
void f1( ) {
int v3;
#pragma omp parallel for
for (int i=0; i < n; i++) {
int v4;
T2 *v5 = malloc(sizeof(T2));
}}
Storage, assuming two threads
red is shared,
green is private to thread 0,
blue is private to thread 1
statics and globals: v1
heap
T1
global stack
main: v2
foo: v3
t0 stack
index: i
t1 stack
index: i
Note private loop index
variables
Wednesday, January 11, 17
Context after first iteration of the
parallel for
int v1;
...
main( ) {
T1 *v2 = malloc(sizeof(T1));
...
f1( );
}
void f1( ) {
int v3;
#pragma omp parallel for
for (i=0; i < n; i++) {
int v4;
T2 *v5 = malloc(sizeof(T2));
}}
Storage, assuming two threads
red is shared,
green is private to thread 0,
blue is private to thread 1
statics and globals: v1
heap
T1
global stack
main: v2
foo: v3
t0 stack
index: i
v4, v5
t1 stack
index: i
v4, v5
T2
T2
Wednesday, January 11, 17
Context after parallel for finishes
int v1;
...
main( ) {
T1 *v2 = malloc(sizeof(T1));
...
f1( );
}
void f1( ) {
int v3;
#pragma omp parallel for
for (i=0; i < n; i++) {
int v4;
T2 *v5 = malloc(sizeof(T2));
}}
Storage, assuming two threads
red is shared,
green is private to thread 0,
blue is private to thread 1
statics and globals: v1
heap
T1
global stack
main: v2
foo: v3
T2
T2
Wednesday, January 11, 17
A slightly different example -- after each
thread has run at least 1 iteration
int v1;
...
main( ) {
T1 *v2 = malloc(sizeof(T1));
...
f1( );
}
void f1( ) {
int v3;
#pragma omp parallel for
for (i=0; i < n; i++) {
int v4;
T2 *v5 = malloc(sizeof(T2));
v2 = (T1) v5
}}
v2 points to one of the
T2 objects that was
allocated.
Which one? It depends.
statics and globals: v1
heap
T1
global stack
main: v2
foo: v3
t0 stack
index: i
v4, v5
t1 stack
index: i
v4, v5
T2
T2
Wednesday, January 11, 17
int v1;
...
main( ) {
T1 *v2 = malloc(sizeof(T1));
...
f1( );
}
void f1( ) {
int v3;
#pragma omp parallel for
for (i=0; i < n; i++) {
int v4;
T2 *v5 = malloc(sizeof(T2));
v2 = (T1) v5
}}
v2 points to the T2
allocated by t0 if t0
executes the statement
v2=(T1) v5; last statics and globals: v1
heap
T1
global stack
main: v2
foo: v3
t0 stack
index: i
v4, v5
t1 stack
index: i
v4, v5
T2
T2
A slightly different example -- after each
thread has run at least 1 iteration
Wednesday, January 11, 17
int v1;
...
main( ) {
T1 *v2 = malloc(sizeof(T1));
...
f1( );
}
void f1( ) {
int v3;
#pragma omp parallel for
for (i=0; i < n; i++) {
int v4;
T2 *v5 = malloc(sizeof(T2));
v2 = (T1) v5
}}
v2 points to the T2
allocated by t1 if t1
executes the statement
v2=(T1) v5; last
statics and globals: v1
heap
T1
global stack
main: v2
foo: v3
t0 stack
index: i
v4, v5
t1 stack
index: i
v4, v5
T2
T2
A slightly different example -- after each
thread has run at least 1 iteration
Wednesday, January 11, 17
int v1;
...
main( ) {
T1 *v2 = malloc(sizeof(T1));
...
f1( );
}
void f1( ) {
int v3;
#pragma omp parallel for
for (i=0; i < n; i++) {
int v4;
T2 *v5 = malloc(sizeof(T2));
v2 = (T1) v5
}}
v2 points to the T2
allocated by t1 if t1
executes the statement
v2=(T1) v5; last
statics and globals: v1
heap
T1
global stack
main: v2
foo: v3
t0 stack
index: i
v4, v5
t1 stack
index: i
v4, v5
T2
T2
A slightly different example -- after each
thread has run at least 1 iteration
The problem with this is that there is
a race on the assignment to the v2
variable
Races are evil, to be avoided, never to be
done except in the rarest of conditions
Wednesday, January 11, 17
Another problem with this code
int v1;
...
main( ) {
T1 *v2 = malloc(sizeof(T1));
...
f1( );
}
void f1( ) {
int v3;
#pragma omp parallel for
for (i=0; i < n; i++) {
int v4;
T2 *v5 = malloc(sizeof(T2));
}}
Storage, assuming two threads
red is shared,
green is private to thread 0,
blue is private to thread 1
statics and globals: v1
heap
...
...
T1
global stack
main: v2
foo: v3
T2
T2
T2
T2
T2
T2
T2
T2
T2
Wednesday, January 11, 17
Querying the number of
physical processors
• Can query the number of physical processors
• returns the number of cores on a multicore
machine
• returns the number of possible hyperthreads on
a hyperthreaded machine
int omp_get_num_procs(void);
Wednesday, January 11, 17
Setting the number of threads
• Number of threads can be more or less than the number of
processors
• if less, some processors or cores will be idle
• if more, more than one thread will execute on a core/
processor
• Operating system and runtime will assign threads to cores
• No guarantee same threads will always run on the same
cores
• Default is number of threads equals number of cores
controlled by the OS image (typically #cores on node/
processor)
int omp_set_num_threads(int t);
Wednesday, January 11, 17
Making more than the
parallel for index private
int i, j;
for (i=0; i
int i, j;
#pragma omp parallel for private(j)
for (i=0; i
• Reductions on commutative operations can
be done in parallel
Wednesday, January 11, 17
A parallel reduction
a[99]a[24] a[25] a[49] a[50] a[74] a[75] …a[0] ………
t[0] =
+a[0:24]
t[3] =
+a[75:99]
t[2] =
+a[50:74]
t[1] =
+a[25:49]
tmp = t[0]
for (i = 1, i < 4; i++)
tmp += t[i];
25
4
speedup = 100/29
= 3.45
O(P) to sum
the partial
sums
Wednesday, January 11, 17
A better parallel reduction
a[99]a[24] a[25] a[49] a[50] a[74] a[75] ...a[0] .........
t[0] =
+a[0:24]
t[3] =
+a[74:99]
t[2] =
+a[50:74]
t[1] =
+a[25:49]
t[0] += t[1]
25
t[2] += t[3]
t[0] += t[2]
1
1
speedup = 100/27
= 3.7
Wednesday, January 11, 17
How can we do this in
OpenMP?
double t[4] = {0.0, 0.0, 0.0, 0.0}
int omp_set_num_threads(4);
#pragma omp parallel for
for (i=0; i < n; i++) {
t[omp_get_thread_num( )] += a[i];
}
avg = 0;
for (i=0; i < 4; i++) }
avg += t[i];
}
avg = avg / n;
This is getting
messy and we
still are using a
O(#threads)
summation of
the partial
sums.
parallel
serial
OpenMP function
Wednesday, January 11, 17
OpenMP provides a
better way
• Reductions are common enough that
OpenMP provides support for them
• reduction clause for omp parallel pragma
• specify variable and operation
• OpenMP takes care of creating
temporaries, computing partial sums, and
computing the final sum
Wednesday, January 11, 17
Dot product example
t=0;
for (i=0; i < n; i++) {
t = t + a[i]*c[i];
}
t=0;
#pragma omp parallel for reduction(+:t)
for (i=0; i < n; i++) {
t = t + (a[i]*c[i]);
}
OpenMP makes t private,
puts the partial sums for
each thread into t, and
then forms the full sum of
t as shown earlier
Wednesday, January 11, 17
Restrictions on Reductions
Operations on the reduction
variable must be of the form
x = x op expr
x = expr op x (except
subtraction)
x binop = expr
x++
++x
x--
--x
• x is a scalar variable in the
list
• expr is a scalar expression
that does not reference x
• op is not overloaded, and is
one of +, *, -, /, &, ^, |, &&, ||
• binop is not overloaded, and
is one of +, *, -, /, &, ^, |
Wednesday, January 11, 17
Why the restrictions on where t
can appear?
#pragma omp parallel for reduction(+:t)
// each element of a[i] = 1
for (i=0; i
for (i=0; i < n; i++) {
t = t + (a[i]*c[i]);
}
Wednesday, January 11, 17
Assigning iterations to
threads (thread scheduling)
• The schedule clause can guide how iterations of a loop
are assigned to threads
• Two kinds of schedules:
• static: iterations are assigned to threads at the start of
the loop. Low overhead but possible load balance
issues.
• dynamic: some iterations are assigned at the start of
the loop, others as the loop progresses. Higher
overheads but better load balance.
• A chunk is a contiguous set of iterations
Wednesday, January 11, 17
The schedule clause - static
• schedule(type[, chunk]) where “[ ]”
indicates optional
• (type [,chunk]) is
• (static): chunks of ~ n/t iterations per
thread, no chunk specified. The default.
• (static, chunk): chunks of size chunk
distributed round-robin. No chunk
specified means chunk = 1
Wednesday, January 11, 17
The schedule clause -
dynamic
• schedule(type[, chunk]) where “[ ]”
indicates optional
• (type [,chunk]) is
• (dynamic): chunks of size of 1 iteration
distributed dynamically
• (dynamic, chunk): chunks of size chunk
distributed dynamically
Wednesday, January 11, 17
Static
Chunk = 1 1, 4, 7, 10, 130, 3, 6, 9, 12 2, 5, 8, 11, 14
thread 0 thread 1 thread 2
Chunk = 2 2, 3, 8, 9, 14, 150, 1, 6, 7, 12, 13 4, 5, 10, 11, 16, 17
thread 0 thread 1 thread 2
With no chunk size specified, the iterations are divided
as evenly as possible among processors, with one chunk
per processor
With dynamic chunks go to processors as work needed.
Wednesday, January 11, 17
The schedule clause
• schedule(type[, chunk]) (type [,chunk]) is
• (guided,chunk): uses guided self scheduling
heuristic. Starts with big chunks and
decreases to a minimum chunk size of chunk
• runtime - type depends on value of
OMP_SCHEDULE environment variable,
e.g. setenv OMP_SCHEDULE=”static,1”
Wednesday, January 11, 17
Guided with two threads
example
31 2 4 65 7 8 9
Wednesday, January 11, 17
Dynamic schedule with
large blocks
3
1 2
4
65
7 8
9
Large blocks
reduce
scheduling
costs, but
lead to large
load
imbalance
Wednesday, January 11, 17
Dynamic schedule with
small blocks
Small blocks have a
smaller load
imbalance, but with
higher scheduling
costs.
Would like the best
of both methods.
1
3
5
7
9
11
23
25
27
. . .
Thread 0
2
4
6
8
10
12
24
26
. . .
Thread 1
Wednesday, January 11, 17
Guided with two threads
By starting out with
larger blocks, and
then ending with
small ones, scheduling
overhead and load
imbalance can both
be minimized.
1 2
34
56
78
9
Wednesday, January 11, 17
67
Program Translation for
Microtasking Scheme
Subroutine x
...
C$OMP PARALLEL DO
DO j=1,n
a(j)=b(j)
ENDDO
…
END
Subroutine x
…
call scheduler(1,n,a,b,loopsub)
…
END
Subroutine loopsub(lb,ub,a,b)
integer lb,ub
DO jj=lb,ub
a(jj)=b(jj)
ENDDO
END
Wednesday, January 11, 17
68
Parallel ExecutionScheme
• Most widely used: Microtasking scheme
Main
task
Helper
tasks
Main task creates helpers
Parallel loop
Parallel loop
Wake up helpers, grab work off
of the queue
Wake up helpers, grab work off of
the queue
Barrier, helpers go back to sleep
Barrier, helpers go back to sleep
Wednesday, January 11, 17
The nowait clause
#pragma omp parallel for
for (i=0; i < n; i++) {
if (a[i] > 0) a[i] += b[i];
}
barrier here by default
#pragma omp parallel for nowait
for (i=0; i < n; i++) {
if (a[i] < 0) a[i] -= b[i];
}
with nowait
i i
j
j
without nowait
i i
j
j
time
Only the static distribution with the same bounds guarantees
the same thread will execute the same iterations from both
loops.
Wednesday, January 11, 17
The sections pragma
Used to specify task parallelism
#pragma omp parallel sections
{
#pragma omp section /* optional */
{
v = f1( )
w = f2( )
}
#pragma omp section
v = f3( )
}
v = f1( )
w = f2()
v = f3( )
Wednesday, January 11, 17
The parallel pragma
#pragma omp parallel private(w)
{
w = getWork (q);
while (w != NULL) {
doWork(w);
w = getWork(q);
}
}
• every processor
executes the statement
following the parallel
pragma
• Parallelism of useful
work in the example
because independent
and different work
pulled of of q
• q needs to be thread
safe
Wednesday, January 11, 17
The parallel pragma
#pragma omp parallel private(w)
{
#pragma omp critical
w = getWork (q);
while (w != NULL) {
doWork(w);
#pragma omp critical
w = getWork(q);
}
}
• If data structure pointed to
by q is not thread safe,
need to synchronize it in
your code
• One way is to use a critical
clause
single and master clauses exist.
Wednesday, January 11, 17
The single directive
Requires statement
following the pragma
to be executed by the
master thread.
Differs from critical in
that critical lets the
statement execute on
many threads, just one
at a time.
#pragma omp parallel private(w)
{
w = getWork (q);
while (w != NULL) {
doWork(w);
w = getWork(q);
}
#pragma omp single
fprintf(“finishing work”);
}
Wednesday, January 11, 17
The master directive
Requires statement
following the pragma to be
executed by the master
thread.
Often the master thread is
thread 0, but this is
implementation dependent.
Master thread is the same
thread for the life of the
program.
#pragma omp parallel private(w)
{
w = getWork (q);
while (w != NULL) {
doWork(w);
w = getWork(q);
}
#pragma omp master
fprintf(“finishing work”);
}
Wednesday, January 11, 17
Cannot use single/
master with for
Many different instances of
the single
#pragma omp parallel for
for (i=0; i < n; i++) {
if (a[i] > 0) {
a[i] += b[i];
#pragma omp single
printf(“exiting”);
}
}
Wednesday, January 11, 17
Does OpenMP provide a way
to specify:
• what parts of the program execute in parallel with one
another
• how the work is distributed across different cores
• the order that reads and writes to memory will take
place
• that a sequence of accesses to a variable will occur
atomically or without interference from other threads.
• And, ideally, it will do this while giving good performance
and allowing maintainable programs to be written.
Wednesday, January 11, 17
What executes in parallel?
c = 57.0;
for (i=0; i < n; i++) {
a[i] = c + a[i]*b[i]
}
c = 57.0
#pragma omp parallel for
for (i=0; i < n; i++) {
a[i] = + c + a[i]*b[i]
}
• pragma appears like a comment to a non-OpenMP
compiler
• pragma requests parallel code to be produced for
the following for loop
Wednesday, January 11, 17
The order that reads and writes to
memory occur
c = 57.0
#pragma omp parallel for schedule(static)
for (i=0; i < n; i++) {
a[i] = c + a[i]*b[i]
}
#pragma omp parallel for schedule(static)
for (i=0; i < n; i++) {
a[i] = c + a[i]*b[i]
}
• Within an iteration, access to data appears in-order
• Across iterations, no order is implied. Races lead to undefined
programs
• Across loops, an implicit barrier prevents a loop from starting
execution until all iterations and writes (stores) to memory in the
previous loop are finished
• Parallel constructs execute after preceding sequential constructs finish
Wednesday, January 11, 17
Relaxing the order that reads and writes
to memory occur
c = 57.0
#pragma omp parallel for schedule(static) nowait
for (i=0; i < n; i++) {
a[i] = c[i] + a[i]*b[i]
}
#pragma omp parallel for schedule(static)
for (i=0; i < n; i++) {
a[i] = c[i] + a[i]*b[i]
}
The nowait clause allows a thread to begin executing its part of the code
after the nowait loop as soon as it finishes its part of the nowait loop
no barrier
Wednesday, January 11, 17
Accessing variables without interference
from other threads
#pragma omp parallel for
for (i=0; i < n; i++) {
a = a + b[i]
}
Dangerous -- all iterations are
updating a at the same time --
a race (or data race).
#pragma omp parallel for
for (i=0; i < n; i++) {
#pragma omp critical
a = a + b[i];
}
Stupid but correct -- critical
pragma allows only one thread
to execute the next statement
at a time. Very inefficient -- but
ok if you have enough work to
make it worthwhile.
Wednesday, January 11, 17