Introduction to OpenMP
Amitava Datta
University of Western Australia
Compiling OpenMP programs
I OpenMP programs written in C are compiled by (for
example): gcc -fopenmp -o prog1 prog1.c
I We have assumed the name of the C file is prog1.c and the
name of the executable is prog1
I The compiler will look for OpenMP directives in your program
for generating code.
I No action is taken if there are no OpenMP directives in your
program.
pragma directive
If you want the compiler to generate code using OpenMP, you
have to use the pragma directive
#include
#include
int main()
{
#pragma omp parallel
{
printf(“The parallel region is executed by thread
%d\n”,omp_get_thread_num());
}
}
#pragma parallel
I When the compiler encounters the parallel directive, it
generates multi-threaded code.
I How many threads will execute the code will depend on how
many threads are specified (more later).
I The default is number of threads equal to number of cores.
I The parallel region is executed by thread 4
The parallel region is executed by thread 3
The parallel region is executed by thread 7
The parallel region is executed by thread 2
The parallel region is executed by thread 5
The parallel region is executed by thread 1
The parallel region is executed by thread 6
The parallel region is executed by thread 0
I But I have only 4 cores in my machine.
Hyperthreading
I Hyperthreading is an Intel technology that treats each
physical core as two logical cores.
I Two threads are executed at the same time (logically) on the
same core.
I Processors (or cores) do not execute instructions in every
clock cycle.
I There is an opportunity to execute another instruction from
another thread when the core is idle.
I Hyperthreading schedules two threads to every core.
I So, my processor has 4 physical cores and 8 logical cores.
Hyperthreading
I The purpose of hyperthreading is to improve the throughput
(processing more per unit time).
I This may or may not happen. In fact hyperthreading may
actually have slower performance.
I Your process will run slower when hyperthreading is turned on.
I It all depends on how well the L1 cache is shared.
I It is possible to turn hyperthreading off through the BIOS
(more on lab sheet).
Threads run independently
I There is only one thread until the parallel directive is
encountered.
I 7 other threads are launched at that point.
I Thread 0 is usually the master thread (that spawns the other
threads.
I The parallel region is enclosed in curly brackets.
I There is an implied barrier at the end of the parallel region.
What is a barier?
I A barrier is a place in the process where all threads must
reach before further processing occurs.
I Threads may run away without barriers and it is necessary
many times to have barriers at different places in a process.
I Barriers are sometime implicit (like here), barriers sometime
can be removed (more later).
I Barriers are expensive in terms of run time performance. A
typical barrier may take hundreds of clock cycles to ensure
that all threads have reached the barrier.
I It is better to remove barriers, but this is fraught with danger.
A variation of our code
#include
#include
int main()
{
#pragma omp parallel
{
if (omp_get_thread_num()==3) sleep(1);
printf(“The parallel region is executed by thread %d\n”,omp_get_thread_num());
}
}
Output
The parallel region is executed by thread 4
The parallel region is executed by thread 7
The parallel region is executed by thread 1
The parallel region is executed by thread 2
The parallel region is executed by thread 5
The parallel region is executed by thread 6
The parallel region is executed by thread 0
The parallel region is executed by thread 3
I Thread 3 is now suspended for 1 second, so all other threads
complete before thread 3.
5
Outline
Introduction to OpenMP
Creating Threads
Synchronization
Parallel Loops
Synchronize single masters and stuff
Data environment
Schedule your for and sections
Memory model
OpenMP 3.0 and Tasks
6
OpenMP* Overview:
omp_set_lock(lck)
#pragma omp parallel for private(A, B)
#pragma omp critical
C$OMP parallel do shared(a, b, c)
C$OMP PARALLEL REDUCTION (+: A, B)
call OMP_INIT_LOCK (ilok)
call omp_test_lock(jlok)
setenv OMP_SCHEDULE “dynamic”
CALL OMP_SET_NUM_THREADS(10)
C$OMP DO lastprivate(XX)
C$OMP ORDERED
C$OMP SINGLE PRIVATE(X)
C$OMP SECTIONS
C$OMP MASTERC$OMP ATOMIC
C$OMP FLUSH
C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C)
C$OMP THREADPRIVATE(/ABC/)
C$OMP PARALLEL COPYIN(/blk/)
Nthrds = OMP_GET_NUM_PROCS()
!$OMP BARRIER
OpenMP: An API for Writing Multithreaded
Applications
A set of compiler directives and library
routines for parallel application programmers
Greatly simplifies writing multi-threaded (MT)
programs in Fortran, C and C++
Standardizes last 20 years of SMP practice
* The name “OpenMP” is the property of the OpenMP Architecture Review Board.
7
OpenMP Basic Defs: Solution Stack
OpenMP Runtime library
OS/system support for shared memory and threading
S
ys
te
m
la
ye
r
Directives,
Compiler
OpenMP library Environment variablesPr
og
.
L a
y e
r
Application
End User
U
se
r
l a
y e
r
Shared Address Space
Proc3Proc2Proc1 ProcN
H
W
8
OpenMP core syntax
Most of the constructs in OpenMP are compiler
directives.
#pragma omp construct [clause [clause]…]
Example
#pragma omp parallel num_threads(4)
Function prototypes and types in the file:
#include
Most OpenMP* constructs apply to a
“structured block”.
Structured block: a block of one or more statements
with one point of entry at the top and one point of
exit at the bottom.
It’s OK to have an exit() within the structured block.
9
Exercise 1, Part A: Hello world
Verify that your environment works
Write a program that prints “hello world”.
void main()
{
int ID = 0;
printf(“ hello(%d) ”, ID);
printf(“ world(%d) \n”, ID);
}
void main()
{
int ID = 0;
printf(“ hello(%d) ”, ID);
printf(“ world(%d) \n”, ID);
}
10
Exercise 1, Part B: Hello world
Verify that your OpenMP environment works
Write a multithreaded program that prints “hello world”.
void main()
{
int ID = 0;
printf(“ hello(%d) ”, ID);
printf(“ world(%d) \n”, ID);
}
void main()
{
int ID = 0;
printf(“ hello(%d) ”, ID);
printf(“ world(%d) \n”, ID);
}
#pragma omp parallel
{
}
#include “omp.h”
Switches for compiling and linking
-fopenmp gcc
-mp pgi
/Qopenmp intel
11
Exercise 1: Solution
A multi-threaded “Hello world” program
Write a multithreaded program where each
thread prints “hello world”.
#include “omp.h”
void main()
{
#pragma omp parallel
{
int ID = omp_get_thread_num();
printf(“ hello(%d) ”, ID);
printf(“ world(%d) \n”, ID);
}
}
#include “omp.h”
void main()
{
#pragma omp parallel
{
int ID = omp_get_thread_num();
printf(“ hello(%d) ”, ID);
printf(“ world(%d) \n”, ID);
}
}
Sample Output:
hello(1) hello(0) world(1)
world(0)
hello (3) hello(2) world(3)
world(2)
Sample Output:
hello(1) hello(0) world(1)
world(0)
hello (3) hello(2) world(3)
world(2)
OpenMP include fileOpenMP include file
Parallel region with default
number of threads
Parallel region with default
number of threads
Runtime library function to
return a thread ID.
Runtime library function to
return a thread ID.End of the Parallel regionEnd of the Parallel region
12
OpenMP Overview:
How do threads interact?
OpenMP is a multi-threading, shared address
model.
– Threads communicate by sharing variables.
Unintended sharing of data causes race
conditions:
– race condition: when the program’s outcome
changes as the threads are scheduled differently.
To control race conditions:
– Use synchronization to protect data conflicts.
Synchronization is expensive so:
– Change how data is accessed to minimize the need
for synchronization.
13
Outline
Introduction to OpenMP
Creating Threads
Synchronization
Parallel Loops
Synchronize single masters and stuff
Data environment
Schedule your for and sections
Memory model
OpenMP 3.0 and Tasks
14
OpenMP Programming Model:
Fork-Join Parallelism:
Master thread spawns a team of threads as needed.
Parallelism added incrementally until performance goals
are met: i.e. the sequential program evolves into a
parallel program.
Parallel Regions
Master
Thread
in red
A Nested
Parallel
region
A Nested
Parallel
region
Sequential Parts
15
Thread Creation: Parallel Regions
You create threads in OpenMP* with the parallel
construct.
For example, To create a 4 thread Parallel region:
double A[1000];
omp_set_num_threads(4);
#pragma omp parallel
{
int ID = omp_get_thread_num();
pooh(ID,A);
}
Each thread calls Each thread calls pooh(ID,A) for for ID = = 0 to to 3
Each thread
executes a
copy of the
code within
the
structured
block
Each thread
executes a
copy of the
code within
the
structured
block
Runtime function to
request a certain
number of threads
Runtime function to
request a certain
number of threads
Runtime function
returning a thread ID
Runtime function
returning a thread ID
* The name “OpenMP” is the property of the OpenMP Architecture Review Board
16
Thread Creation: Parallel Regions
You create threads in OpenMP* with the parallel
construct.
For example, To create a 4 thread Parallel region:
double A[1000];
#pragma omp parallel num_threads(4)
{
int ID = omp_get_thread_num();
pooh(ID,A);
}
Each thread calls Each thread calls pooh(ID,A) for for ID = = 0 to to 3
Each thread
executes a
copy of the
code within
the
structured
block
Each thread
executes a
copy of the
code within
the
structured
block
clause to request a certain
number of threads
clause to request a certain
number of threads
Runtime function
returning a thread ID
Runtime function
returning a thread ID
* The name “OpenMP” is the property of the OpenMP Architecture Review Board
17
Thread Creation: Parallel Regions example
Each thread executes the
same code redundantly.
double A[1000];
omp_set_num_threads(4);
#pragma omp parallel
{
int ID = omp_get_thread_num();
pooh(ID, A);
}
printf(“all done\n”);omp_set_num_threads(4)
pooh(1,A) pooh(2,A) pooh(3,A)
printf(“all done\n”);
pooh(0,A)
double A[1000];
A single
copy of A
is shared
between all
threads.
A single
copy of A
is shared
between all
threads.
Threads wait here for all threads to
finish before proceeding (i.e. a barrier)
Threads wait here for all threads to
finish before proceeding (i.e. a barrier)
* The name “OpenMP” is the property of the OpenMP Architecture Review Board
27
Discussed later
SPMD vs. worksharing
A parallel construct by itself creates an SPMD
or “Single Program Multiple Data” program …
i.e., each thread redundantly executes the
same code.
How do you split up pathways through the
code between threads within a team?
This is called worksharing
– Loop construct
– Sections/section constructs
– Single construct
– Task construct …. Coming in OpenMP 3.0
28
The loop worksharing Constructs
The loop workharing construct splits up loop
iterations among the threads in a team
#pragma omp parallel
{
#pragma omp for
for (I=0;I
void input_parameters (int, int); // fetch values of input parameters
void do_work(int, int);
void main()
{
int Nsize, choice;
#pragma omp parallel private (Nsize, choice)
{
#pragma omp single copyprivate (Nsize, choice)
input_parameters (Nsize, choice);
do_work(Nsize, choice);
}
}
#include
void input_parameters (int, int); // fetch values of input parameters
void do_work(int, int);
void main()
{
int Nsize, choice;
#pragma omp parallel private (Nsize, choice)
{
#pragma omp single copyprivate (Nsize, choice)
input_parameters (Nsize, choice);
do_work(Nsize, choice);
}
}
Used with a single region to broadcast values of privates
from one member of a team to the rest of the team.
37
Synchronization: Barrier
Barrier: Each thread waits until all threads arrive.
#pragma omp parallel shared (A, B, C) private(id)
{
id=omp_get_thread_num();
A[id] = big_calc1(id);
#pragma omp barrier
#pragma omp for
for(i=0;i
#pragma omp task
postorder(p->left);
if (p->right)
#pragma omp task
postorder(p->right);
#pragma omp taskwait // wait for descendants
process(p->data);
}
Parent task suspended until children tasks complete
Task scheduling point
3.0
95
Task switching
Certain constructs have task scheduling points
at defined locations within them
When a thread encounters a task scheduling
point, it is allowed to suspend the current task
and execute another (called task switching)
It can then return to the original task and
resume
3.0
96
Task switching example
#pragma omp single
{
for (i=0; i