Introduction
Introduction to OpenMP
(Originally for CS 838, Wisconsin-Madison)
Shuaiwen Leon Song
Slides are derived from online references of
National Laboratory, National Energy Research Scientific Computing Center, University of Minnesota, OpenMP.org
*
(C) 2006
*
Introduction to OpenMP
What is OpenMP?
Open specification for Multi-Processing
“Standard” API for defining multi-threaded shared-memory programs
openmp.org – Talks, examples, forums, etc.
computing.llnl.gov/tutorials/openMP/
portal.xsede.org/online-training
www.nersc.gov/assets/Uploads/XE62011OpenMP.pdf
High-level API
Preprocessor (compiler) directives ( ~ 80% )
Library Calls ( ~ 19% )
Environment Variables ( ~ 1% )
CS267 Lecture 2
*
*
(C) 2006
*
A Programmer’s View of OpenMP
OpenMP is a portable, threaded, shared-memory programming specification with “light” syntax
Exact behavior depends on OpenMP implementation!
Requires compiler support (C, C++ or Fortran)
OpenMP will:
Allow a programmer to separate a program into serial regions and parallel regions, rather than T concurrently-executing threads.
Hide stack management
Provide synchronization constructs
OpenMP will not:
Parallelize automatically
Guarantee speedup
Provide freedom from data races
CS267 Lecture 2
*
*
(C) 2006
CS Architecture Seminar
Outline
Introduction
Motivating example
Parallel Programming is Hard
OpenMP Programming Model
Easier than PThreads
Microbenchmark Performance Comparison
vs. PThreads
Discussion
specOMP
CS Architecture Seminar
*
(C) 2006
CS Architecture Seminar
Current Parallel Programming
Start with a parallel algorithm
Implement, keeping in mind:
Data races
Synchronization
Threading Syntax
Test & Debug
Debug
Debug
CS Architecture Seminar
*
(C) 2006
CS Architecture Seminar
Motivation – Threading Library
void* SayHello(void *foo) {
printf( “Hello, world!\n” );
return NULL;
}
int main() {
pthread_attr_t attr;
pthread_t threads[16];
int tn;
pthread_attr_init(&attr);
pthread_attr_setscope(&attr, PTHREAD_SCOPE_SYSTEM);
for(tn=0; tn<16; tn++) {
pthread_create(&threads[tn], &attr, SayHello, NULL);
}
for(tn=0; tn<16 ; tn++) {
pthread_join(threads[tn], NULL);
}
return 0;
}
CS Architecture Seminar
*
(C) 2006
CS Architecture Seminar
Motivation
Thread libraries are hard to use
P-Threads/Solaris threads have many library calls for initialization, synchronization, thread creation, condition variables, etc.
Programmer must code with multiple threads in mind
Synchronization between threads introduces a new dimension of program correctness
CS Architecture Seminar
*
(C) 2006
CS Architecture Seminar
Motivation
Wouldn’t it be nice to write serial programs and somehow parallelize them “automatically”?
OpenMP can parallelize many serial programs with relatively few annotations that specify parallelism and independence
OpenMP is a small API that hides cumbersome threading calls with simpler directives
CS Architecture Seminar
*
(C) 2006
CS Architecture Seminar
Better Parallel Programming
Start with some algorithm
Embarrassing parallelism is helpful, but not necessary
Implement serially, ignoring:
Data Races
Synchronization
Threading Syntax
Test and Debug
Automatically (magically?) parallelize
Expect linear speedup
CS Architecture Seminar
*
(C) 2006
CS Architecture Seminar
Motivation – OpenMP
int main() {
// Do this part in parallel
printf( "Hello, World!\n" );
return 0;
}
CS Architecture Seminar
*
(C) 2006
CS Architecture Seminar
Motivation – OpenMP
int main() {
omp_set_num_threads(16);
// Do this part in parallel
#pragma omp parallel
{
printf( "Hello, World!\n" );
}
return 0;
}
CS Architecture Seminar
*
(C) 2006
CS Architecture Seminar
OpenMP Parallel Programming
Start with a parallelizable algorithm
Embarrassing parallelism is good, loop-level parallelism is necessary
Implement serially, mostly ignoring:
Data Races
Synchronization
Threading Syntax
Test and Debug
Annotate the code with parallelization (and synchronization) directives
Hope for linear speedup
Test and Debug
CS Architecture Seminar
*
(C) 2006
LLNL OpenMP Tutorial
No better materials then LLNL OpenMP tutorial.
We are now going to go through some of the important knowledge points using the tutorial:
https://computing.llnl.gov/tutorials/openMP/
*
*
(C) 2006
Programming Model
*
Because OpenMP is designed for shared memory parallel programming, it largely limited to single node parallelism. Typically, the number of processing elements (cores) on a node determine how much parallelism can be implemented.
*
(C) 2006
*
Programming Model – Concurrent Loops
OpenMP easily parallelizes loops
Requires: No data dependencies (reads/write or write/write pairs) between iterations!
Preprocessor calculates loop bounds for each thread directly from serial source
for( i=0; i < 25; i++ ) {
printf(“Foo”);
}
#pragma omp parallel for
?
?
CS267 Lecture 2
*
*
(C) 2006
Motivation of using OpenMP
*
*
(C) 2006
CS Architecture Seminar
Programming Model - Threading
Serial regions by default, annotate to create parallel regions
Generic parallel regions
Parallelized loops
Sectioned parallel regions
Thread-like Fork/Join model
Arbitrary number of logical thread creation/ destruction events
Fork
Join
CS Architecture Seminar
*
(C) 2006
CS Architecture Seminar
Programming Model - Threading
int main() {
}
// serial region
printf(“Hello…”);
// serial again
printf(“!”);
Fork
Join
// parallel region
#pragma omp parallel
{
printf(“World”);
}
Hello…WorldWorldWorldWorld!
CS Architecture Seminar
*
(C) 2006
CS Architecture Seminar
Programming Model – Nested Threading
Fork/Join can be nested
Nesting complication handled “automagically” at compile-time
Independent of the number of threads actually running
Fork
Join
Fork
Join
CS Architecture Seminar
*
(C) 2006
CS Architecture Seminar
Programming Model – Thread Identification
Master Thread
Thread with ID=0
Only thread that exists in sequential regions
Depending on implementation, may have special purpose inside parallel regions
Some special directives affect only the master thread (like master)
Fork
Join
0
0 1 2 3 4 5 6 7
0
CS Architecture Seminar
*
(C) 2006
CS Architecture Seminar
Programming Model – Data/Control Parallelism
Data parallelism
Threads perform similar functions, guided by thread identifier
Control parallelism
Threads perform differing functions
One thread for I/O, one for computation, etc…
Fork
Join
CS Architecture Seminar
*
(C) 2006
CS Architecture Seminar
Programming Model – Concurrent Loops
OpenMP easily parallelizes loops
No data dependencies between iterations!
Preprocessor calculates loop bounds for each thread directly from serial source
for( i=0; i < 25; i++ ) {
printf(“Foo”);
}
#pragma omp parallel for
?
?
CS Architecture Seminar
*
(C) 2006
CS Architecture Seminar
Programming Model – Loop Scheduling
schedule clause determines how loop iterations are divided among the thread team
static([chunk]) divides iterations statically between threads
Each thread receives [chunk] iterations, rounding as necessary to account for all iterations
Default [chunk] is ceil( # iterations / # threads )
dynamic([chunk]) allocates [chunk] iterations per thread, allocating an additional [chunk] iterations when a thread finishes
Forms a logical work queue, consisting of all loop iterations
Default [chunk] is 1
guided([chunk]) allocates dynamically, but [chunk] is exponentially reduced with each allocation
CS Architecture Seminar
*
(C) 2006
CS Architecture Seminar
Programming Model – Loop Scheduling
for( i=0; i<16; i++ )
{
doIteration(i);
}
// Static Scheduling
int chunk = 16/T;
int base = tid * chunk;
int bound = (tid+1)*chunk;
for( i=base; i