Microsoft PowerPoint – omp-hands-on-SC08 (2).ppt [Read-Only]
1
A “Hands-on” Introduction to
OpenMP*
Tim Mattson
Principal Engineer
Intel Corporation
timothy.g.mattson@intel.com
* The name “OpenMP” is the property of the OpenMP Architecture Review Board.
Larry Meadows
Principal Engineer
Intel Corporation
lawrence.f.meadows@intel.com
2
Preliminaries: part 1
Disclosures
The views expressed in this tutorial are those of the
people delivering the tutorial.
– We are not speaking for our employers.
– We are not speaking for the OpenMP ARB
This is a new tutorial for us:
Help us improve … tell us how you would make this
tutorial better.
3
Preliminaries: Part 2
Our plan for the day .. Active learning!
We will mix short lectures with short exercises.
You will use your laptop for the exercises … that
way you’ll have an OpenMP environment to take
home so you can keep learning on your own.
Please follow these simple rules
Do the exercises we assign and then change things
around and experiment.
– Embrace active learning!
Don’t cheat: Do Not look at the solutions before
you complete an exercise … even if you get really
frustrated.
4
Our Plan for the day
Tasks and other OpenMP 3
features
Linked listIX OpenMP 3 and tasks
Point to point synch with flushProducer
consumer
VIII. Memory model
For, schedules, sectionsLinked list,
matmul
VII. Worksharing and
schedule
Data environment details,
modular software,
threadprivate
Pi_mcVI. Data Environment
Single, master, runtime
libraries, environment
variables, synchronization, etc.
No exerciseV. Odds and ends
For, reductionPi_loopIV. Parallel loops
False sharing, critical, atomicPi_spmd_finalIII. Synchronization
Parallel, default data
environment, runtime library
calls
Pi_spmd_simpleII. Creating threads
Parallel regionsInstall sw,
hello_world
I. OMP Intro
conceptsExerciseTopic
Break
Break
lunch
5
Outline
Introduction to OpenMP
Creating Threads
Synchronization
Parallel Loops
Synchronize single masters and stuff
Data environment
Schedule your for and sections
Memory model
OpenMP 3.0 and Tasks
6
OpenMP* Overview:
omp_set_lock(lck)
#pragma omp parallel for private(A, B)
#pragma omp critical
C$OMP parallel do shared(a, b, c)
C$OMP PARALLEL REDUCTION (+: A, B)
call OMP_INIT_LOCK (ilok)
call omp_test_lock(jlok)
setenv OMP_SCHEDULE “dynamic”
CALL OMP_SET_NUM_THREADS(10)
C$OMP DO lastprivate(XX)
C$OMP ORDERED
C$OMP SINGLE PRIVATE(X)
C$OMP SECTIONS
C$OMP MASTERC$OMP ATOMIC
C$OMP FLUSH
C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C)
C$OMP THREADPRIVATE(/ABC/)
C$OMP PARALLEL COPYIN(/blk/)
Nthrds = OMP_GET_NUM_PROCS()
!$OMP BARRIER
OpenMP: An API for Writing Multithreaded
Applications
A set of compiler directives and library
routines for parallel application programmers
Greatly simplifies writing multi-threaded (MT)
programs in Fortran, C and C++
Standardizes last 20 years of SMP practice
* The name “OpenMP” is the property of the OpenMP Architecture Review Board.
7
OpenMP Basic Defs: Solution Stack
OpenMP Runtime library
OS/system support for shared memory and threading
S
ys
te
m
la
ye
r
Directives,
Compiler
OpenMP library Environment variablesPr
og
.
L a
y e
r
Application
End User
U
se
r
l a
y e
r
Shared Address Space
Proc3Proc2Proc1 ProcN
H
W
8
OpenMP core syntax
Most of the constructs in OpenMP are compiler
directives.
#pragma omp construct [clause [clause]…]
Example
#pragma omp parallel num_threads(4)
Function prototypes and types in the file:
#include
Most OpenMP* constructs apply to a
“structured block”.
Structured block: a block of one or more statements
with one point of entry at the top and one point of
exit at the bottom.
It’s OK to have an exit() within the structured block.
9
Exercise 1, Part A: Hello world
Verify that your environment works
Write a program that prints “hello world”.
void main()
{
int ID = 0;
printf(“ hello(%d) ”, ID);
printf(“ world(%d) \n”, ID);
}
void main()
{
int ID = 0;
printf(“ hello(%d) ”, ID);
printf(“ world(%d) \n”, ID);
}
10
Exercise 1, Part B: Hello world
Verify that your OpenMP environment works
Write a multithreaded program that prints “hello world”.
void main()
{
int ID = 0;
printf(“ hello(%d) ”, ID);
printf(“ world(%d) \n”, ID);
}
void main()
{
int ID = 0;
printf(“ hello(%d) ”, ID);
printf(“ world(%d) \n”, ID);
}
#pragma omp parallel
{
}
#include “omp.h”
Switches for compiling and linking
-fopenmp gcc
-mp pgi
/Qopenmp intel
11
Exercise 1: Solution
A multi-threaded “Hello world” program
Write a multithreaded program where each
thread prints “hello world”.
#include “omp.h”
void main()
{
#pragma omp parallel
{
int ID = omp_get_thread_num();
printf(“ hello(%d) ”, ID);
printf(“ world(%d) \n”, ID);
}
}
#include “omp.h”
void main()
{
#pragma omp parallel
{
int ID = omp_get_thread_num();
printf(“ hello(%d) ”, ID);
printf(“ world(%d) \n”, ID);
}
}
Sample Output:
hello(1) hello(0) world(1)
world(0)
hello (3) hello(2) world(3)
world(2)
Sample Output:
hello(1) hello(0) world(1)
world(0)
hello (3) hello(2) world(3)
world(2)
OpenMP include fileOpenMP include file
Parallel region with default
number of threads
Parallel region with default
number of threads
Runtime library function to
return a thread ID.
Runtime library function to
return a thread ID.End of the Parallel regionEnd of the Parallel region
12
OpenMP Overview:
How do threads interact?
OpenMP is a multi-threading, shared address
model.
– Threads communicate by sharing variables.
Unintended sharing of data causes race
conditions:
– race condition: when the program’s outcome
changes as the threads are scheduled differently.
To control race conditions:
– Use synchronization to protect data conflicts.
Synchronization is expensive so:
– Change how data is accessed to minimize the need
for synchronization.
13
Outline
Introduction to OpenMP
Creating Threads
Synchronization
Parallel Loops
Synchronize single masters and stuff
Data environment
Schedule your for and sections
Memory model
OpenMP 3.0 and Tasks
14
OpenMP Programming Model:
Fork-Join Parallelism:
Master thread spawns a team of threads as needed.
Parallelism added incrementally until performance goals
are met: i.e. the sequential program evolves into a
parallel program.
Parallel Regions
Master
Thread
in red
A Nested
Parallel
region
A Nested
Parallel
region
Sequential Parts
15
Thread Creation: Parallel Regions
You create threads in OpenMP* with the parallel
construct.
For example, To create a 4 thread Parallel region:
double A[1000];
omp_set_num_threads(4);
#pragma omp parallel
{
int ID = omp_get_thread_num();
pooh(ID,A);
}
Each thread calls Each thread calls pooh(ID,A) for for ID = = 0 to to 3
Each thread
executes a
copy of the
code within
the
structured
block
Each thread
executes a
copy of the
code within
the
structured
block
Runtime function to
request a certain
number of threads
Runtime function to
request a certain
number of threads
Runtime function
returning a thread ID
Runtime function
returning a thread ID
* The name “OpenMP” is the property of the OpenMP Architecture Review Board
16
Thread Creation: Parallel Regions
You create threads in OpenMP* with the parallel
construct.
For example, To create a 4 thread Parallel region:
double A[1000];
#pragma omp parallel num_threads(4)
{
int ID = omp_get_thread_num();
pooh(ID,A);
}
Each thread calls Each thread calls pooh(ID,A) for for ID = = 0 to to 3
Each thread
executes a
copy of the
code within
the
structured
block
Each thread
executes a
copy of the
code within
the
structured
block
clause to request a certain
number of threads
clause to request a certain
number of threads
Runtime function
returning a thread ID
Runtime function
returning a thread ID
* The name “OpenMP” is the property of the OpenMP Architecture Review Board
17
Thread Creation: Parallel Regions example
Each thread executes the
same code redundantly.
double A[1000];
omp_set_num_threads(4);
#pragma omp parallel
{
int ID = omp_get_thread_num();
pooh(ID, A);
}
printf(“all done\n”);omp_set_num_threads(4)
pooh(1,A) pooh(2,A) pooh(3,A)
printf(“all done\n”);
pooh(0,A)
double A[1000];
A single
copy of A
is shared
between all
threads.
A single
copy of A
is shared
between all
threads.
Threads wait here for all threads to
finish before proceeding (i.e. a barrier)
Threads wait here for all threads to
finish before proceeding (i.e. a barrier)
* The name “OpenMP” is the property of the OpenMP Architecture Review Board
18
Exercises 2 to 4:
Numerical Integration
∫
4.0
(1+x2) dx = π
0
1
∑ F(xi)Δx ≈ π
i = 0
N
Mathematically, we know that:
We can approximate the
integral as a sum of
rectangles:
Where each rectangle has
width Δx and height F(xi) at
the middle of interval i.
F(
x)
=
4
.0
/(1
+x
2 )
4.0
2.0
1.0
X0.0
19
Exercises 2 to 4: Serial PI Program
static long num_steps = 100000;
double step;
void main ()
{ int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;
}}
20
Exercise 2
Create a parallel version of the pi program
using a parallel construct.
Pay close attention to shared versus private
variables.
In addition to a parallel construct, you will need
the runtime library routines
int omp_get_num_threads();
int omp_get_thread_num();
double omp_get_wtime();
Time in Seconds since a
fixed point in the past
Thread ID or rank
Number of threads in
the team
21
Outline
Introduction to OpenMP
Creating Threads
Synchronization
Parallel Loops
Synchronize single masters and stuff
Data environment
Schedule your for and sections
Memory model
OpenMP 3.0 and Tasks
22
Discussed
later
Synchronization
High level synchronization:
– critical
– atomic
– barrier
– ordered
Low level synchronization
– flush
– locks (both simple and nested)
Synchronization is
used to impose order
constraints and to
protect access to
shared data
23
Synchronization: critical
Mutual exclusion: Only one thread at a time
can enter a critical region.
float res;
#pragma omp parallel
{ float B; int i, id, nthrds;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
for(i=id;i
void main()
{ int num_threads;
omp_set_dynamic( 0 );
omp_set_num_threads( omp_num_procs() );
#pragma omp parallel
{ int id=omp_get_thread_num();
#pragma omp single
num_threads = omp_get_num_threads();
do_lots_of_stuff(id);
}
}
Protect this op since Memory
stores are not atomic
Request as many threads as
you have processors.
Disable dynamic adjustment of the
number of threads.
Even in this case, the system may give you fewer
threads than requested. If the precise # of threads
matters, test for it and respond accordingly.
Even in this case, the system may give you fewer
threads than requested. If the precise # of threads
matters, test for it and respond accordingly.
45
Environment Variables
Set the default number of threads to use.
– OMP_NUM_THREADS int_literal
Control how “omp for schedule(RUNTIME)”
loop iterations are scheduled.
– OMP_SCHEDULE “schedule[, chunk_size]”
… Plus several less commonly used environment variables.
46
Outline
Introduction to OpenMP
Creating Threads
Synchronization
Parallel Loops
Synchronize single masters and stuff
Data environment
Schedule your for and sections
Memory model
OpenMP 3.0 and Tasks
47
Data environment:
Default storage attributes
Shared Memory programming model:
– Most variables are shared by default
Global variables are SHARED among threads
– Fortran: COMMON blocks, SAVE variables, MODULE
variables
– C: File scope variables, static
– Both: dynamically allocated memory (ALLOCATE, malloc, new)
But not everything is shared…
– Stack variables in subprograms(Fortran) or functions(C) called
from parallel regions are PRIVATE
– Automatic variables within a statement block are PRIVATE.
48
double A[10];
int main() {
int index[10];
#pragma omp parallel
work(index);
printf(“%d\n”, index[0]);
}
extern double A[10];
void work(int *index) {
double temp[10];
static int count;
…
}
Data sharing: Examples
temp
A, index, count
temp temp
A, index, count
A, index and count are
shared by all threads.
temp is local to each
thread
A, index and count are
shared by all threads.
temp is local to each
thread
* Third party trademarks and names are the property of their respective owner.
49
Data sharing:
Changing storage attributes
One can selectively change storage attributes for
constructs using the following clauses*
– SHARED
– PRIVATE
– FIRSTPRIVATE
The final value of a private inside a parallel loop can be
transmitted to the shared variable outside the loop with:
– LASTPRIVATE
The default attributes can be overridden with:
– DEFAULT (PRIVATE | SHARED | NONE)
All the clauses on this page
apply to the OpenMP construct
NOT to the entire region.
All the clauses on this page
apply to the OpenMP construct
NOT to the entire region.
All data clauses apply to parallel constructs and worksharing constructs except
“shared” which only applies to parallel constructs.
DEFAULT(PRIVATE) is Fortran only
50
Data Sharing: Private Clause
void wrong() {
int tmp = 0;
#pragma omp for private(tmp)
for (int j = 0; j < 1000; ++j)
tmp += j;
printf(“%d\n”, tmp);
}
private(var) creates a new local copy of var for each thread.
– The value is uninitialized
– In OpenMP 2.5 the value of the shared variable is undefined after
the region
tmp was not
initialized
tmp was not
initialized
tmp: 0 in 3.0,
unspecified in 2.5
tmp: 0 in 3.0,
unspecified in 2.5
51
Data Sharing: Private Clause
When is the original variable valid?
int tmp;
void danger() {
tmp = 0;
#pragma omp parallel private(tmp)
work();
printf(“%d\n”, tmp);
}
The original variable’s value is unspecified in OpenMP 2.5.
In OpenMP 3.0, if it is referenced outside of the construct
– Implementations may reference the original variable or a copy …..
A dangerous programming practice!
extern int tmp;
void work() {
tmp = 5;
}
unspecified which
copy of tmp
unspecified which
copy of tmptmp has unspecified
value
tmp has unspecified
value
52
Data Sharing: Firstprivate Clause
Firstprivate is a special case of private.
– Initializes each private copy with the corresponding
value from the master thread.
tmp: 0 in 3.0, unspecified in 2.5tmp: 0 in 3.0, unspecified in 2.5
void useless() {
int tmp = 0;
#pragma omp for firstprivate(tmp)
for (int j = 0; j < 1000; ++j)
tmp += j;
printf(“%d\n”, tmp);
}
Each thread gets its own
tmp with an initial value of 0
Each thread gets its own
tmp with an initial value of 0
53
Data sharing: Lastprivate Clause
Lastprivate passes the value of a private from the
last iteration to a global variable.
tmp is defined as its value at the “last
sequential” iteration (i.e., for j=999)
tmp is defined as its value at the “last
sequential” iteration (i.e., for j=999)
void closer() {
int tmp = 0;
#pragma omp parallel for firstprivate(tmp) \
lastprivate(tmp)
for (int j = 0; j < 1000; ++j)
tmp += j;
printf(“%d\n”, tmp);
}
Each thread gets its own tmp
with an initial value of 0
Each thread gets its own tmp
with an initial value of 0
54
Data Sharing:
A data environment test
Consider this example of PRIVATE and FIRSTPRIVATE
Are A,B,C local to each thread or shared inside the parallel region?
What are their initial values inside and values after the parallel region?
variables A,B, and C = 1
#pragma omp parallel private(B) firstprivate(C)
Inside this parallel region ...
“A” is shared by all threads; equals 1
“B” and “C” are local to each thread.
– B’s initial value is undefined
– C’s initial value equals 1
Outside this parallel region ...
The values of “B” and “C” are unspecified in OpenMP 2.5, and in
OpenMP 3.0 if referenced in the region but outside the construct.
55
Data Sharing: Default Clause
Note that the default storage attribute is DEFAULT(SHARED) (so
no need to use it)
Exception: #pragma omp task
To change default: DEFAULT(PRIVATE)
each variable in the construct is made private as if specified in a
private clause
mostly saves typing
DEFAULT(NONE): no default for variables in static extent. Must
list storage attribute for each variable in static extent. Good
programming practice!
Only the Fortran API supports default(private).
C/C++ only has default(shared) or default(none).
56
Data Sharing: Default Clause Example
itotal = 1000
C$OMP PARALLEL DEFAULT(PRIVATE) SHARED(itotal)
np = omp_get_num_threads()
each = itotal/np
………
C$OMP END PARALLEL
itotal = 1000
C$OMP PARALLEL PRIVATE(np, each)
np = omp_get_num_threads()
each = itotal/np
………
C$OMP END PARALLEL
These two
code
fragments are
equivalent
57
Data Sharing: tasks (OpenMP 3.0)
The default for tasks is usually firstprivate, because the task may
not be executed until later (and variables may have gone out of
scope).
Variables that are shared in all constructs starting from the
innermost enclosing parallel construct are shared, because the
barrier guarantees task completion.
#pragma omp parallel shared(A) private(B)
{
...
#pragma omp task
{
int C;
compute(A, B, C);
}
}
A is shared
B is firstprivate
C is private
3.0
58
Data sharing: Threadprivate
Makes global data private to a thread
Fortran: COMMON blocks
C: File scope and static variables, static class members
Different from making them PRIVATE
with PRIVATE global variables are masked.
THREADPRIVATE preserves global scope within each
thread
Threadprivate variables can be initialized using
COPYIN or at time of definition (using language-
defined initialization capabilities).
59
A threadprivate example (C)
int counter = 0;
#pragma omp threadprivate(counter)
int increment_counter()
{
counter++;
return (counter);
}
int counter = 0;
#pragma omp threadprivate(counter)
int increment_counter()
{
counter++;
return (counter);
}
Use threadprivate to create a counter for each
thread.
60
Data Copying: Copyin
parameter (N=1000)
common/buf/A(N)
!$OMP THREADPRIVATE(/buf/)
C Initialize the A array
call init_data(N,A)
!$OMP PARALLEL COPYIN(A)
… Now each thread sees threadprivate array A initialied
… to the global value set in the subroutine init_data()
!$OMP END PARALLEL
end
parameter (N=1000)
common/buf/A(N)
!$OMP THREADPRIVATE(/buf/)
C Initialize the A array
call init_data(N,A)
!$OMP PARALLEL COPYIN(A)
… Now each thread sees threadprivate array A initialied
… to the global value set in the subroutine init_data()
!$OMP END PARALLEL
end
You initialize threadprivate data using a copyin
clause.
61
Data Copying: Copyprivate
#include
void input_parameters (int, int); // fetch values of input parameters
void do_work(int, int);
void main()
{
int Nsize, choice;
#pragma omp parallel private (Nsize, choice)
{
#pragma omp single copyprivate (Nsize, choice)
input_parameters (Nsize, choice);
do_work(Nsize, choice);
}
}
#include
void input_parameters (int, int); // fetch values of input parameters
void do_work(int, int);
void main()
{
int Nsize, choice;
#pragma omp parallel private (Nsize, choice)
{
#pragma omp single copyprivate (Nsize, choice)
input_parameters (Nsize, choice);
do_work(Nsize, choice);
}
}
Used with a single region to broadcast values of privates
from one member of a team to the rest of the team.
62
Exercise 5: Monte Carlo Calculations
Using Random numbers to solve tough problems
Sample a problem domain to estimate areas, compute
probabilities, find optimal values, etc.
Example: Computing π with a digital dart board:
Throw darts at the circle/square.
Chance of falling in circle is
proportional to ratio of areas:
Ac = r2 * π
As = (2*r) * (2*r) = 4 * r2
P = Ac/As = π /4
Compute π by randomly choosing
points, count the fraction that falls in
the circle, compute pi.
2 * r
N= 10 π = 2.8
N=100 π = 3.16
N= 1000 π = 3.148
N= 10 π = 2.8
N=100 π = 3.16
N= 1000 π = 3.148
63
Exercise 5
We provide three files for this exercise
pi_mc.c: the monte carlo method pi program
random.c: a simple random number generator
random.h: include file for random number generator
Create a parallel version of this program without
changing the interfaces to functions in random.c
This is an exercise in modular software … why should a user
of your parallel random number generator have to know any
details of the generator or make any changes to how the
generator is called?
Extra Credit:
Make the random number generator threadsafe.
Make your random number generator numerically correct (non-
overlapping sequences of pseudo-random numbers).
64
Outline
Introduction to OpenMP
Creating Threads
Synchronization
Parallel Loops
Synchronize single masters and stuff
Data environment
Schedule your for and sections
Memory model
OpenMP 3.0 and Tasks
65
Sections worksharing Construct
The Sections worksharing construct gives a
different structured block to each thread.
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
X_calculation();
#pragma omp section
y_calculation();
#pragma omp section
z_calculation();
}
}
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
X_calculation();
#pragma omp section
y_calculation();
#pragma omp section
z_calculation();
}
}
By default, there is a barrier at the end of the “omp
sections”. Use the “nowait” clause to turn off the barrier.
66
loop worksharing constructs:
The schedule clause
The schedule clause affects how loop iterations are
mapped onto threads
schedule(static [,chunk])
– Deal-out blocks of iterations of size “chunk” to each thread.
schedule(dynamic[,chunk])
– Each thread grabs “chunk” iterations off a queue until all
iterations have been handled.
schedule(guided[,chunk])
– Threads dynamically grab blocks of iterations. The size of the
block starts large and shrinks down to size “chunk” as the
calculation proceeds.
schedule(runtime)
– Schedule and chunk size taken from the OMP_SCHEDULE
environment variable (or the runtime library … for OpenMP 3.0).
67
Special case of dynamic
to reduce scheduling
overhead
GUIDED
Unpredictable, highly
variable work per
iteration
DYNAMIC
Pre-determined and
predictable by the
programmer
STATIC
When To UseSchedule Clause
loop work-sharing constructs:
The schedule clauseThe schedule clause
Least work at
runtime :
scheduling
done at
compile-time
Least work at
runtime :
scheduling
done at
compile-time
Most work at
runtime :
complex
scheduling
logic used at
run-time
Most work at
runtime :
complex
scheduling
logic used at
run-time
68
Exercise 6: hard
Consider the program linked.c
Traverses a linked list computing a sequence of
Fibonacci numbers at each node.
Parallelize this program using constructs
defined in OpenMP 2.5 (loop worksharing
constructs).
Once you have a correct program, optimize it.
69
Exercise 6: easy
Parallelize the matrix multiplication program in
the file matmul.c
Can you optimize the program by playing with
how the loops are scheduled?
70
Outline
Introduction to OpenMP
Creating Threads
Synchronization
Parallel Loops
Synchronize single masters and stuff
Data environment
Schedule your for and sections
Memory model
OpenMP 3.0 and Tasks
71
OpenMP memory model
proc1 proc2 proc3 procN
Shared memory
cache1 cache2 cache3 cacheN
a
a
a
. . .
A memory model is defined in terms of:
Coherence: Behavior of the memory system when a single
address is accessed by multiple threads.
Consistency: Orderings of accesses to different addresses by
multiple threads.
OpenMP supports a shared memory model.
All threads share an address space, but it can get complicated:
72
Source code
Program order
memory
a b
Commit order
private view
thread thread
private view
threadprivatethreadprivatea ab b
Wa Wb Ra Rb . . .
OpenMP Memory Model: Basic Terms
compiler
Executable code
Code order
Wb Rb Wa Ra . . .
RW’s in any
semantically
equivalent order
73
Consistency: Memory Access Re-ordering
Re-ordering:
Compiler re-orders program order to the code order
Machine re-orders code order to the memory
commit order
At a given point in time, the temporary view of
memory may vary from shared memory.
Consistency models based on orderings of
Reads (R), Writes (W) and Synchronizations (S):
R→R, W→W, R→W, R→S, S→S, W→S
74
Consistency
Sequential Consistency:
In a multi-processor, ops (R, W, S) are sequentially
consistent if:
– They remain in program order for each
processor.
– They are seen to be in the same overall order by
each of the other processors.
Program order = code order = commit order
Relaxed consistency:
Remove some of the ordering constraints for
memory ops (R, W, S).
75
OpenMP and Relaxed Consistency
OpenMP defines consistency as a variant of
weak consistency:
S ops must be in sequential order across threads.
Can not reorder S ops with R or W ops on the same
thread
– Weak consistency guarantees
S→W, S→R , R→S, W→S, S→S
The Synchronization operation relevant to this
discussion is flush.
76
Flush
Defines a sequence point at which a thread is
guaranteed to see a consistent view of memory with
respect to the “flush set”.
The flush set is:
“all thread visible variables” for a flush construct without an
argument list.
a list of variables when the “flush(list)” construct is used.
The action of Flush is to guarantee that:
– All R,W ops that overlap the flush set and occur prior to the
flush complete before the flush executes
– All R,W ops that overlap the flush set and occur after the
flush don’t execute until after the flush.
– Flushes with overlapping flush sets can not be reordered.
Memory ops: R = Read, W = write, S = synchronization
77
Synchronization: flush example
Flush forces data to be updated in memory so other
threads see the most recent value
double A;
A = compute();
flush(A); // flush to memory to make sure other
// threads can pick up the right value
Note: OpenMP’s flush is analogous to a fence in
other shared memory API’s.
Note: OpenMP’s flush is analogous to a fence in
other shared memory API’s.
78
Exercise 7: producer consumer
Parallelize the “prod_cons.c” program.
This is a well known pattern called the
producer consumer pattern
One thread produces values that another thread
consumes.
Often used with a stream of produced values to
implement “pipeline parallelism”
The key is to implement pairwise
synchronization between threads.
79
Exercise 7: prod_cons.c
int main()
{
double *A, sum, runtime; int flag = 0;
A = (double *)malloc(N*sizeof(double));
runtime = omp_get_wtime();
fill_rand(N, A); // Producer: fill an array of data
sum = Sum_array(N, A); // Consumer: sum the array
runtime = omp_get_wtime() – runtime;
printf(” In %lf seconds, The sum is %lf \n”,runtime,sum);
}
I need to put the
prod/cons pair
inside a loop so its
true pipeline
parallelism.
80
What is the Big Deal with Flush?
Compilers routinely reorder instructions implementing
a program
This helps better exploit the functional units, keep machine
busy, hide memory latencies, etc.
Compiler generally cannot move instructions:
past a barrier
past a flush on all variables
But it can move them past a flush with a list of
variables so long as those variables are not accessed
Keeping track of consistency when flushes are used
can be confusing … especially if “flush(list)” is used.
Note: the flush operation does not actually synchronize
different threads. It just ensures that a thread’s values
are made consistent with main memory.
81
Outline
Introduction to OpenMP
Creating Threads
Synchronization
Parallel Loops
Synchronize single masters and stuff
Data environment
Schedule your for and sections
Memory model
OpenMP 3.0 and Tasks
82
OpenMP pre-history
OpenMP based upon SMP directive
standardization efforts PCF and aborted ANSI
X3H5 – late 80’s
Nobody fully implemented either standard
Only a couple of partial implementations
Vendors considered proprietary API’s to be a
competitive feature:
Every vendor had proprietary directives sets
Even KAP, a “portable” multi-platform parallelization
tool used different directives on each platform
PCF – Parallel computing forum KAP – parallelization tool from KAI.
83
History of OpenMP
SGI
Cray
Merged,
needed
commonality
across
products
KAI ISV – needed
larger market
was tired of
recoding for
SMPs. Urged
vendors to
standardize.
ASCI
Wrote a
rough draft
straw man
SMP API
DEC
IBM
Intel
HP
Other vendors
invited to join
1997
84
OpenMP Release History
OpenMP
Fortran 1.1
OpenMP
Fortran 1.1
OpenMP
C/C++ 1.0
OpenMP
C/C++ 1.0
OpenMP
Fortran 2.0
OpenMP
Fortran 2.0
OpenMP
C/C++ 2.0
OpenMP
C/C++ 2.0
1998
20001999
2002
OpenMP
Fortran 1.0
OpenMP
Fortran 1.0
1997
OpenMP
2.5
OpenMP
2.5
2005
A single
specification
for Fortran, C
and C++
OpenMP
3.0
OpenMP
3.0
tasking,
other new
features
2008
85
Tasks
Adding tasking is the biggest addition for 3.0
Worked on by a separate subcommittee
led by Jay Hoeflinger at Intel
Re-examined issue from ground up
quite different from Intel taskq’s
3.0
86
General task characteristics
A task has
Code to execute
A data environment (it owns its data)
An assigned thread that executes the code and
uses the data
Two activities: packaging and execution
Each encountering thread packages a new instance
of a task (code and data)
Some thread in the team executes the task at some
later time
3.0
87
Definitions
Task construct – task directive plus structured
block
Task – the package of code and instructions
for allocating data created when a thread
encounters a task construct
Task region – the dynamic sequence of
instructions produced by the execution of a
task by a thread
3.0
88
Tasks and OpenMP
Tasks have been fully integrated into OpenMP
Key concept: OpenMP has always had tasks, we just
never called them that.
Thread encountering parallel construct packages
up a set of implicit tasks, one per thread.
Team of threads is created.
Each thread in team is assigned to one of the tasks
(and tied to it).
Barrier holds original master thread until all implicit
tasks are finished.
We have simply added a way to create a task explicitly
for the team to execute.
Every part of an OpenMP program is part of one task or
another!
3.0
89
task Construct
#pragma omp task [clause[[,]clause] …]
structured-block
if (expression)
untied
shared (list)
private (list)
firstprivate (list)
default( shared | none )
where clause can be one of:
3.0
90
The if clause
When the if clause argument is false
The task is executed immediately by the encountering
thread.
The data environment is still local to the new task…
…and it’s still a different task with respect to
synchronization.
It’s a user directed optimization
when the cost of deferring the task is too great
compared to the cost of executing the task code
to control cache and memory affinity
3.0
91
When/where are tasks complete?
At thread barriers, explicit or implicit
applies to all tasks generated in the current parallel
region up to the barrier
matches user expectation
At task barriers
i.e. Wait until all tasks defined in the current task have
completed.
#pragma omp taskwait
Note: applies only to tasks generated in the current task,
not to “descendants” .
3.0
92
Example – parallel pointer chasing
using tasks
#pragma omp parallel
{
#pragma omp single private(p)
{
p = listhead ;
while (p) {
#pragma omp task
process (p)
p=next (p) ;
}
}
}
p is firstprivate inside
this task
3.0
93
Example – parallel pointer chasing on
multiple lists using tasks
#pragma omp parallel
{
#pragma omp for private(p)
for ( int i =0; i
#pragma omp task
postorder(p->left);
if (p->right)
#pragma omp task
postorder(p->right);
#pragma omp taskwait // wait for descendants
process(p->data);
}
Parent task suspended until children tasks complete
Task scheduling point
3.0
95
Task switching
Certain constructs have task scheduling points
at defined locations within them
When a thread encounters a task scheduling
point, it is allowed to suspend the current task
and execute another (called task switching)
It can then return to the original task and
resume
3.0
96
Task switching example
#pragma omp single
{
for (i=0; i
static long num_steps = 100000; double step;
#define NUM_THREADS 2
void main ()
{ int i, nthreads; double pi, sum[NUM_THREADS];
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
int i, id,nthrds;
double x;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if (id == 0) nthreads = nthrds;
for (i=id, sum[id]=0.0;i< num_steps; i=i+nthrds) {
x = (i+0.5)*step;
sum[id] += 4.0/(1.0+x*x);
}
}
for(i=0, pi=0.0;i
static long num_steps = 100000; double step;
#define NUM_THREADS 2
void main ()
{ double pi; step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
int i, id,nthrds; double x, sum;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
if (id == 0) nthreads = nthrds;
id = omp_get_thread_num();
nthrds = omp_get_num_threads();
for (i=id, sum=0.0;i< num_steps; i=i+nthreads){
x = (i+0.5)*step;
sum += 4.0/(1.0+x*x);
}
#pragma omp critical
pi += sum * step;
}
}
Exercise 3: SPMD Pi without false sharing
Sum goes “out of scope” beyond the
parallel region … so you must sum it in
here. Must protect summation into pi in
a critical region so updates don’t conflict
Sum goes “out of scope” beyond the
parallel region … so you must sum it in
here. Must protect summation into pi in
a critical region so updates don’t conflict
No array, so
no false
sharing.
No array, so
no false
sharing.
Create a scalar local
to each thread to
accumulate partial
sums.
Create a scalar local
to each thread to
accumulate partial
sums.
122
Appendix: Solutions to exercises
Exercise 1: hello world
Exercise 2: Simple SPMD Pi program
Exercise 3: SPMD Pi without false sharing
Exercise 4: Loop level Pi
Exercise 5: Producer-consumer
Exercise 6: Monte Carlo Pi and random numbers
Exercise 7: hard, linked lists without tasks
Exercise 7: easy, matrix multiplication
Exercise 8: linked lists with tasks
123
Exercise 4: solution
#include
static long num_steps = 100000; double step;
#define NUM_THREADS 2
void main ()
{ int i; double x, pi, sum = 0.0;
step = 1.0/(double) num_steps;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for private(x) reduction(+:sum)
for (i=0;i< num_steps; i++){
x = (i+0.5)*step;
sum = sum + 4.0/(1.0+x*x);
}
pi = step * sum;
}
Note: we created a parallel
program without changing
any code and by adding 4
simple lines!
Note: we created a parallel
program without changing
any code and by adding 4
simple lines!
i private
by default
i private
by default
For good OpenMP
implementations,
reduction is more
scalable than critical.
For good OpenMP
implementations,
reduction is more
scalable than critical.
124
Appendix: Solutions to exercises
Exercise 1: hello world
Exercise 2: Simple SPMD Pi program
Exercise 3: SPMD Pi without false sharing
Exercise 4: Loop level Pi
Exercise 5: Monte Carlo Pi and random numbers
Exercise 6: hard, linked lists without tasks
Exercise 6: easy, matrix multiplication
Exercise 7: Producer-consumer
Exercise 8: linked lists with tasks
125
Computers and random numbers
We use “dice” to make random numbers:
Given previous values, you cannot predict the next value.
There are no patterns in the series … and it goes on forever.
Computers are deterministic machines … set an initial
state, run a sequence of predefined instructions, and
you get a deterministic answer
By design, computers are not random and cannot produce
random numbers.
However, with some very clever programming, we can
make “pseudo random” numbers that are as random as
you need them to be … but only if you are very careful.
Why do I care? Random numbers drive statistical
methods used in countless applications:
Sample a large space of alternatives to find statistically good
answers (Monte Carlo methods).
126
Monte Carlo Calculations:
Using Random numbers to solve tough problems
Sample a problem domain to estimate areas, compute
probabilities, find optimal values, etc.
Example: Computing π with a digital dart board:
Throw darts at the circle/square.
Chance of falling in circle is
proportional to ratio of areas:
Ac = r2 * π
As = (2*r) * (2*r) = 4 * r2
P = Ac/As = π /4
Compute π by randomly choosing
points, count the fraction that falls in
the circle, compute pi.
2 * r
N= 10 π = 2.8
N=100 π = 3.16
N= 1000 π = 3.148
N= 10 π = 2.8
N=100 π = 3.16
N= 1000 π = 3.148
127
Parallel Programmers love Monte Carlo
algorithms
#include “omp.h”
static long num_trials = 10000;
int main ()
{
long i; long Ncirc = 0; double pi, x, y;
double r = 1.0; // radius of circle. Side of squrare is 2*r
seed(0,-r, r); // The circle and square are centered at the origin
#pragma omp parallel for private (x, y) reduction (+:Ncirc)
for(i=0;i
count++;
}
p = head;
for(i=0; i
}
#pragma omp parallel
{
#pragma omp for schedule(static,1)
for(i=0; i
for (p = head; p != NULL; p = p->next)
nodelist.push_back(p);
int j = (int)nodelist.size();
#pragma omp parallel for schedule(static,1)
for (int i = 0; i < j; ++i)
processwork(nodelist[i]);
47 seconds
37 seconds
C++, default sched.
28 seconds32 secondsTwo Threads
45 seconds49 secondsOne Thread
C, (static,1)C++, (static,1)
Copy pointer to each node into an array
Count number of items in the linked list
Process nodes in parallel with a for loop
Results on an Intel dual core 1.83 GHz CPU, Intel IA-32 compiler 10.1 build 2
143
Appendix: Solutions to exercises
Exercise 1: hello world
Exercise 2: Simple SPMD Pi program
Exercise 3: SPMD Pi without false sharing
Exercise 4: Loop level Pi
Exercise 5: Monte Carlo Pi and random numbers
Exercise 6: hard, linked lists without tasks
Exercise 6: easy, matrix multiplication
Exercise 7: Producer-consumer
Exercise 8: linked lists with tasks
144
Matrix multiplication
#pragma omp parallel for private(tmp, i, j, k)
for (i=0; i
}
}
}
30 seconds28 secondsTwo Threads
48 seconds45 secondsOne Thread
Intel taskqArray, Static, 1
Results on an Intel dual core 1.83 GHz CPU, Intel IA-32 compiler 10.1 build 2
150
Linked lists with tasks (OpenMP 3)
See the file Linked_intel_taskq.c
#pragma omp parallel
{
#pragma omp single
{
p=head;
while (p) {
#pragma omp task firstprivate(p)
processwork(p);
p = p->next;
}
}
}
Creates a task with
its own copy of “p”
initialized to the
value of “p” when
the task is defined
151
Compiler notes: Intel on Windows
Intel compiler:
Launch SW dev environment … on my laptop I use:
– start/intel software development tools/intel C++
compiler 10.1/C+ build environment for 32 bit
apps
cd to the directory that holds your source code
Build software for program foo.c
– icl /Qopenmp foo.c
Set number of threads environment variable
– set OMP_NUM_THREADS=4
Run your program
– foo.exe To get rid of the pwd on the
prompt, type
prompt = %
152
Compiler notes: Visual Studio
Start “new project”
Select win 32 console project
Set name and path
On the next panel, Click “next” instead of finish so you can
select an empty project on the following panel.
Drag and drop your source file into the source folder on the
visual studio solution explorer
Activate OpenMP
– Go to project properties/configuration
properties/C.C++/language … and activate OpenMP
Set number of threads inside the program
Build the project
Run “without debug” from the debug menu.
153
Notes from the SC08 tutorial
It seemed to go very well. We had over 50 people who stuck it out
throughout the day.
Most people brought their laptops (only 7 loaner laptops were used). And
of those with laptops, most had preloaded an OS.
The chaos at the beginning was much less than I expected. I had fears of
an hour or longer to get everyone setup. But thanks to PGI providing a
license key in a temp file, we were able to get everyone installed in short
order.
Having a team of 4 (two speakers and two assistants) worked well. It
would have been messier without a hardcore compiler person such as
Larry. With dozens of different configurations, he had to do some serious
trouble-shooting to get the most difficult cases up and running.
The exercises used early in the course were good. The ones after lunch
were too hard. I need to refine the exercise set. One idea is to for each
slot have an “easy” exercise and a “hard” exercise. This will help me
keep the group’s experience more balanced.
Most people didn’t get the threadprivate exercise. The few people who
tried the linked-list exercise were amazingly creative … one even gettting
a single/nowait version to work.