Parallel Programming in C with the Message Passing Interface
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Parallel Programming
in C with MPI and OpenMP
Michael J. Quinn
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Chapter 17
Shared-memory Programming
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Outline
OpenMP
Shared-memory model
Parallel for loops
Declaring private variables
Critical sections
Reductions
Performance improvements
More general data parallelism
Functional parallelism
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
OpenMP
OpenMP: An application programming interface (API) for parallel programming on multiprocessors
Compiler directives
Library of support functions
OpenMP works in conjunction with Fortran, C, or C++
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
What’s OpenMP Good For?
C + OpenMP sufficient to program multiprocessors
C + MPI + OpenMP a good way to program multicomputers built out of multiprocessors
IBM RS/6000 SP
Fujitsu AP3000
Dell High Performance Computing Cluster
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Shared-memory Model
Processors interact and synchronize with each
other through shared variables.
1.bin
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Fork/Join Parallelism
Initially only master thread is active
Master thread executes sequential code
Fork: Master thread creates or awakens additional threads to execute parallel code
Join: At end of parallel code created threads die or are suspended
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Fork/Join Parallelism
2.bin
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Shared-memory Model vs.
Message-passing Model (#1)
Shared-memory model
Number active threads 1 at start and finish of program, changes dynamically during execution
Message-passing model
All processes active throughout execution of program
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Incremental Parallelization
Sequential program a special case of a shared-memory parallel program
Parallel shared-memory programs may only have a single parallel loop
Incremental parallelization: process of converting a sequential program to a parallel program a little bit at a time
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Shared-memory Model vs.
Message-passing Model (#2)
Shared-memory model
Execute and profile sequential program
Incrementally make it parallel
Stop when further effort not warranted
Message-passing model
Sequential-to-parallel transformation requires major effort
Transformation done in one giant step rather than many tiny steps
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Parallel for Loops
C programs often express data-parallel operations as for loops
for (i = first; i < size; i += prime)
marked[i] = 1;
OpenMP makes it easy to indicate when the iterations of a loop may execute in parallel
Compiler takes care of generating code that forks/joins threads and allocates the iterations to threads
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Pragmas
Pragma: a compiler directive in C or C++
Stands for “pragmatic information”
A way for the programmer to communicate with the compiler
Compiler free to ignore pragmas
Syntax:
#pragma omp
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Parallel for Pragma
Format:
#pragma omp parallel for
for (i = 0; i < N; i++)
a[i] = b[i] + c[i];
Compiler must be able to verify the run-time system will have information it needs to schedule loop iterations
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Canonical Shape of for Loop Control Clause
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Execution Context
Every thread has its own execution context
Execution context: address space containing all of the variables a thread may access
Contents of execution context:
static variables
dynamically allocated data structures in the heap
variables on the run-time stack
additional run-time stack for functions invoked by the thread
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Shared and Private Variables
Shared variable: has same address in execution context of every thread
Private variable: has different address in execution context of every thread
A thread cannot access the private variables of another thread
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Shared and Private Variables
4.bin
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Function omp_get_num_procs
Returns number of physical processors available for use by the parallel program
int omp_get_num_procs (void)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Function omp_set_num_threads
Uses the parameter value to set the number of threads to be active in parallel sections of code
May be called at multiple points in a program
void omp_set_num_threads (int t)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Pop Quiz:
Write a C program segment that sets the number of threads equal to the number of processors that are available.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Declaring Private Variables
for (i = 0; i < BLOCK_SIZE(id,p,n); i++)
for (j = 0; j < n; j++)
a[i][j] = MIN(a[i][j],a[i][k]+tmp);
Either loop could be executed in parallel
We prefer to make outer loop parallel, to reduce number of forks/joins
We then must give each thread its own private copy of variable j
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
private Clause
Clause: an optional, additional component to a pragma
Private clause: directs compiler to make one or more variables private
private (
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example Use of private Clause
#pragma omp parallel for private(j)
for (i = 0; i < BLOCK_SIZE(id,p,n); i++)
for (j = 0; j < n; j++)
a[i][j] = MIN(a[i][j],a[i][k]+tmp);
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
firstprivate Clause
Used to create private variables having initial values identical to the variable controlled by the master thread as the loop is entered
Variables are initialized once per thread, not once per loop iteration
If a thread modifies a variable’s value in an iteration, subsequent iterations will get the modified value
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example Use of firstprivate Clause
X[0] = complex_function();
for (i=0; i
Operators
+ Sum
* Product
& Bitwise and
| Bitwise or
^ Bitwise exclusive or
&& Logical and
|| Logical or
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
-finding Code with Reduction Clause
double area, pi, x;
int i, n;
…
area = 0.0;
#pragma omp parallel for \
private(x) reduction(+:area)
for (i = 0; i < n; i++) {
x = (i + 0.5)/n;
area += 4.0/(1.0 + x*x);
}
pi = area / n;
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Performance Improvement #1
Too many fork/joins can lower performance
Inverting loops may help performance if
Parallelism is in inner loop
After inversion, the outer loop can be made parallel
Inversion does not significantly lower cache hit rate
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Performance Improvement #2
If loop has too few iterations, fork/join overhead is greater than time savings from parallel execution
The if clause instructs compiler to insert code that determines at run-time whether loop should be executed in parallel; e.g.,
#pragma omp parallel for if(n > 5000)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Performance Improvement #3
We can use schedule clause to specify how iterations of a loop should be allocated to threads
Static schedule: all iterations allocated to threads before any iterations executed
Dynamic schedule: only some iterations allocated to threads at beginning of loop’s execution. Remaining iterations allocated to threads that complete their assigned iterations.
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Static vs. Dynamic Scheduling
Static scheduling
Low overhead
May exhibit high workload imbalance
Dynamic scheduling
Higher overhead
Can reduce workload imbalance
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Chunks
A chunk is a contiguous range of iterations
Increasing chunk size reduces overhead and may increase cache hit rate
Decreasing chunk size allows finer balancing of workloads
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
schedule Clause
Syntax of schedule clause
schedule (
Schedule type required, chunk size optional
Allowable schedule types
static: static allocation
dynamic: dynamic allocation
guided: guided self-scheduling
runtime: type chosen at run-time based on value of environment variable OMP_SCHEDULE
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Scheduling Options
schedule(static): block allocation of about n/t contiguous iterations to each thread
schedule(static,C): interleaved allocation of chunks of size C to threads
schedule(dynamic): dynamic one-at-a-time allocation of iterations to threads
schedule(dynamic,C): dynamic allocation of C iterations at a time to threads
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Scheduling Options (cont.)
schedule(guided): guided self-scheduling with minimum chunk size 1
schedule(guided, C): dynamic allocation of chunks to tasks using guided self-scheduling heuristic. Initial chunks are bigger, later chunks are smaller, minimum chunk size is C.
schedule(runtime): schedule chosen at run-time based on value of OMP_SCHEDULE; Unix example:
setenv OMP_SCHEDULE “static,1”
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
More General Data Parallelism
Our focus has been on the parallelization of for loops
Other opportunities for data parallelism
processing items on a “to do” list
for loop + additional code outside of loop
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Processing a “To Do” List
6.bin
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Sequential Code (1/2)
int main (int argc, char *argv[])
{
struct job_struct *job_ptr;
struct task_struct *task_ptr;
…
task_ptr = get_next_task (&job_ptr);
while (task_ptr != NULL) {
complete_task (task_ptr);
task_ptr = get_next_task (&job_ptr);
}
…
}
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Sequential Code (2/2)
char *get_next_task(struct job_struct
**job_ptr) {
struct task_struct *answer;
if (*job_ptr == NULL) answer = NULL;
else {
answer = (*job_ptr)->task;
*job_ptr = (*job_ptr)->next;
}
return answer;
}
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Parallelization Strategy
Every thread should repeatedly take next task from list and complete it, until there are no more tasks
We must ensure no two threads take same take from the list; i.e., must declare a critical section
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
parallel Pragma
The parallel pragma precedes a block of code that should be executed by all of the threads
Note: execution is replicated among all threads
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Use of parallel Pragma
#pragma omp parallel private(task_ptr)
{
task_ptr = get_next_task (&job_ptr);
while (task_ptr != NULL) {
complete_task (task_ptr);
task_ptr = get_next_task (&job_ptr);
}
}
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Critical Section for get_next_task
char *get_next_task(struct job_struct
**job_ptr) {
struct task_struct *answer;
#pragma omp critical
{
if (*job_ptr == NULL) answer = NULL;
else {
answer = (*job_ptr)->task;
*job_ptr = (*job_ptr)->next;
}
}
return answer;
}
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Functions for SPMD-style Programming
The parallel pragma allows us to write SPMD-style programs
In these programs we often need to know number of threads and thread ID number
OpenMP provides functions to retrieve this information
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Function omp_get_thread_num
This function returns the thread identification number
If there are t threads, the ID numbers range from 0 to t-1
The master thread has ID number 0
int omp_get_thread_num (void)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Function omp_get_num_threads
Function omp_get_num_threads returns the number of active threads
If call this function from sequential portion of program, it will return 1
int omp_get_num_threads (void)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
for Pragma
The parallel pragma instructs every thread to execute all of the code inside the block
If we encounter a for loop that we want to divide among threads, we use the for pragma
#pragma omp for
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example Use of for Pragma
#pragma omp parallel private(i,j)
for (i = 0; i < m; i++) {
low = a[i];
high = b[i];
if (low > high) {
printf (“Exiting (%d)\n”, i);
break;
}
#pragma omp for
for (j = low; j < high; j++)
c[j] = (c[j] - a[i])/b[i];
}
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
single Pragma
Suppose we only want to see the output once
The single pragma directs compiler that only a single thread should execute the block of code the pragma precedes
Syntax:
#pragma omp single
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Use of single Pragma
#pragma omp parallel private(i,j)
for (i = 0; i < m; i++) {
low = a[i];
high = b[i];
if (low > high) {
#pragma omp single
printf (“Exiting (%d)\n”, i);
break;
}
#pragma omp for
for (j = low; j < high; j++)
c[j] = (c[j] - a[i])/b[i];
}
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
nowait Clause
Compiler puts a barrier synchronization at end of every parallel for statement
In our example, this is necessary: if a thread leaves loop and changes low or high, it may affect behavior of another thread
If we make these private variables, then it would be okay to let threads move ahead, which could reduce execution time
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Use of nowait Clause
#pragma omp parallel private(i,j,low,high)
for (i = 0; i < m; i++) {
low = a[i];
high = b[i];
if (low > high) {
#pragma omp single
printf (“Exiting (%d)\n”, i);
break;
}
#pragma omp for nowait
for (j = low; j < high; j++)
c[j] = (c[j] - a[i])/b[i];
}
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Functional Parallelism
To this point all of our focus has been on exploiting data parallelism
OpenMP allows us to assign different threads to different portions of code (functional parallelism)
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Functional Parallelism Example
v = alpha();
w = beta();
x = gamma(v, w);
y = delta();
printf ("%6.2f\n", epsilon(x,y));
May execute alpha,
beta, and delta in
parallel
7.bin
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
parallel sections Pragma
Precedes a block of k blocks of code that may be executed concurrently by k threads
Syntax:
#pragma omp parallel sections
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
section Pragma
Precedes each block of code within the encompassing block preceded by the parallel sections pragma
May be omitted for first parallel section after the parallel sections pragma
Syntax:
#pragma omp section
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Example of parallel sections
#pragma omp parallel sections
{
#pragma omp section /* Optional */
v = alpha();
#pragma omp section
w = beta();
#pragma omp section
y = delta();
}
x = gamma(v, w);
printf ("%6.2f\n", epsilon(x,y));
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Another Approach
Execute alpha and
beta in parallel.
Execute gamma and
delta in parallel.
8.bin
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
sections Pragma
Appears inside a parallel block of code
Has same meaning as the parallel sections pragma
If multiple sections pragmas inside one parallel block, may reduce fork/join costs
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Use of sections Pragma
#pragma omp parallel
{
#pragma omp sections
{
v = alpha();
#pragma omp section
w = beta();
}
#pragma omp sections
{
x = gamma(v, w);
#pragma omp section
y = delta();
}
}
printf ("%6.2f\n", epsilon(x,y));
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Summary (1/3)
OpenMP an API for shared-memory parallel programming
Shared-memory model based on fork/join parallelism
Data parallelism
parallel for pragma
reduction clause
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Summary (2/3)
Functional parallelism (parallel sections pragma)
SPMD-style programming (parallel pragma)
Critical sections (critical pragma)
Enhancing performance of parallel for loops
Inverting loops
Conditionally parallelizing loops
Changing loop scheduling
Copyright © The McGraw-Hill Companies, Inc. Permission required for reproduction or display.
Summary (3/3)
Characteristic OpenMP MPI
Suitable for multiprocessors Yes Yes
Suitable for multicomputers No Yes
Supports incremental parallelization Yes No
Minimal extra code Yes No
Explicit control of memory hierarchy No Yes
Processor
Processor
Processor
Processor
Memory
Time
fork
join
Master Thread
fork
join
Other threads
)
index
index
index
index
index
index
index
index
index
index
index
index
;
index
;
index
(
for
ï
ï
ï
ï
ï
þ
ï
ï
ï
ï
ï
ý
ü
ï
ï
ï
ï
ï
î
ï
ï
ï
ï
ï
í
ì
-
=
+
=
+
=
=
-
=
+
-
-
-
-
+
+
+
+
ï
ï
þ
ï
ï
ý
ü
ï
ï
î
ï
ï
í
ì
>
>=
<=
<
³
=
inc
inc
inc
inc
inc
end
start
int
main (
int
argc
, char *
argv
[])
{
int
b[3];
char *
cptr
;
int
i;
cptr
=
malloc
(1);
#
pragma
omp
parallel for
for
(i = 0; i < 3; i++)
b[i] = i;
Heap
Stack
cptr
b
i
i
i
Master Thread
(Thread 0)
Thread 1
Thread A
Thread B
Value of area
11.667
+ 3.765
+ 3.563
11.667
15.432
15.230
Heap
job_
ptr
Shared
Variables
Master Thread
Thread 1
task_
ptr
task_
ptr
alpha
beta
gamma
delta
epsilon