程序代写代做代考 compiler c/c++ scheme data structure cache Fortran Basic OpenMP

Basic OpenMP

Wednesday, January 11, 17

What is OpenMP

• An open standard for shared memory programming in
C/C++ and Fortran

• supported by IBM, Intel, Gnu and others
• Compiler directives and library support
• OpenMP programs are typically still legal to execute

sequentially

• Allows program to be incrementally
parallelized

• Can be used with MPI — will discuss that later

Wednesday, January 11, 17

OpenMP Hardware Model

CPU CPU CPU CPU

cache cache cachecache

bus

Memory I/O devices

Uniform
memory
access
shared

memory
machine is
assumed

Wednesday, January 11, 17

Fork/Join Parallelism

• Program execution starts with a single
master thread

• Master thread executes sequential code
• When parallel part of the program is

encountered, a fork utilizes other worker
threads

• At the end of the parallel region, a join kills
or suppends the worker threads

Wednesday, January 11, 17

Parallel execution using
threadsmaster thread

fork spawns
or wakes up

worker threads

join at end of omp
parallel pragma

Green is parallel execution
Red is sequential
For efficiency, worker
threads are suspended, not
killed, at the end of the
execution

reuses
same

threads
from
last
fork

Wednesday, January 11, 17

Where is the work in
programs?

• For many programs, most of the work is in loops
• C and Fortran often use loops to express data

parallel operations

• the same operation applied to many
independent data elements

for (i = first; i < size; i += prime) marked[i] = 1; Wednesday, January 11, 17 OpenMP Pragmas • OpenMP expresses parallelism and other information using pragmas • A C/C++ or Fortran compiler is free to ignore a pragma -- this means that OpenMP programs have serial as well as parallel semantics • outcome of the program should be the same in either case • #pragma omp is the general
form of a pragma

Wednesday, January 11, 17

pragma for parallel for

• OpenMP programmers use the parallel for pragma
to tell the compiler a loop is parallel

#pragma omp parallel for
for (i=0; i < n; i++) { a[i] = b[i] + c[i]; Wednesday, January 11, 17 Syntax of the parallel for control clause • start is an integer index variable • rel-op is one of {<, <=, >=, >}
• val is an integer expression
• incr is one of {index++, ++index, index–, –index, index

+=val, index-=val, index=index+val, index=val+index,
index=index-val

• OpenMP needs enough information from the loop to
run the loop on multiple threads

for (index = start; index rel-op val; incr)

Wednesday, January 11, 17

Each thread has an execution
context

• Each thread must be able to access all of the
storage it references

• The execution context contains
• static and global variables
• heap allocated storage
• variables on the stack belonging to functions

called along the way to invoking the thread

• a thread-local stack for functions invoked and
block entered during the thread execution

shared/private

Wednesday, January 11, 17

Example of context

int v1;

main( ) {

T1 *v2 = malloc(sizeof(T1));

f1( );

}
void f1( ) {

int v3;
#pragma omp parallel for

for (int i=0; i < n; i++) { int v4; T2 *v5 = malloc(sizeof(T2)); }} Consider the program below: Variables v1, v2, v3 and v4, as well as heap allocated storage, are part of the context. Wednesday, January 11, 17 Context before call to f1 int v1; ... main( ) { T1 *v2 = malloc(sizeof(T1)); ... f1( ); } void f1( ) { int v3; #pragma omp parallel for for (int i=0; i < n; i++) { int v4; T2 *v5 = malloc(sizeof(T2)); }} Storage, assuming two threads red is shared, green is private to thread 0, blue is private to thread 1 statics and globals: v1 heap T1 global stack main: v2 Wednesday, January 11, 17 Context right after call to f1 int v1; ... main( ) { T1 *v2 = malloc(sizeof(T1)); ... f1( ); } void f1( ) { int v3; #pragma omp parallel for for (int i=0; i < n; i++) { int v4; T2 *v5 = malloc(sizeof(T2)); }} Storage, assuming two threads red is shared, green is private to thread 0, blue is private to thread 1 statics and globals: v1 heap T1 global stack main: v2 foo: v3 Wednesday, January 11, 17 Context at start of parallel for int v1; ... main( ) { T1 *v2 = malloc(sizeof(T1)); ... f1( ); } void f1( ) { int v3; #pragma omp parallel for for (int i=0; i < n; i++) { int v4; T2 *v5 = malloc(sizeof(T2)); }} Storage, assuming two threads red is shared, green is private to thread 0, blue is private to thread 1 statics and globals: v1 heap T1 global stack main: v2 foo: v3 t0 stack index: i t1 stack index: i Note private loop index variables Wednesday, January 11, 17 Context after first iteration of the parallel for int v1; ... main( ) { T1 *v2 = malloc(sizeof(T1)); ... f1( ); } void f1( ) { int v3; #pragma omp parallel for for (i=0; i < n; i++) { int v4; T2 *v5 = malloc(sizeof(T2)); }} Storage, assuming two threads red is shared, green is private to thread 0, blue is private to thread 1 statics and globals: v1 heap T1 global stack main: v2 foo: v3 t0 stack index: i v4, v5 t1 stack index: i v4, v5 T2 T2 Wednesday, January 11, 17 Context after parallel for finishes int v1; ... main( ) { T1 *v2 = malloc(sizeof(T1)); ... f1( ); } void f1( ) { int v3; #pragma omp parallel for for (i=0; i < n; i++) { int v4; T2 *v5 = malloc(sizeof(T2)); }} Storage, assuming two threads red is shared, green is private to thread 0, blue is private to thread 1 statics and globals: v1 heap T1 global stack main: v2 foo: v3 T2 T2 Wednesday, January 11, 17 A slightly different example -- after each thread has run at least 1 iteration int v1; ... main( ) { T1 *v2 = malloc(sizeof(T1)); ... f1( ); } void f1( ) { int v3; #pragma omp parallel for for (i=0; i < n; i++) { int v4; T2 *v5 = malloc(sizeof(T2)); v2 = (T1) v5 }} v2 points to one of the T2 objects that was allocated. Which one? It depends. statics and globals: v1 heap T1 global stack main: v2 foo: v3 t0 stack index: i v4, v5 t1 stack index: i v4, v5 T2 T2 Wednesday, January 11, 17 int v1; ... main( ) { T1 *v2 = malloc(sizeof(T1)); ... f1( ); } void f1( ) { int v3; #pragma omp parallel for for (i=0; i < n; i++) { int v4; T2 *v5 = malloc(sizeof(T2)); v2 = (T1) v5 }} v2 points to the T2 allocated by t0 if t0 executes the statement v2=(T1) v5; last statics and globals: v1 heap T1 global stack main: v2 foo: v3 t0 stack index: i v4, v5 t1 stack index: i v4, v5 T2 T2 A slightly different example -- after each thread has run at least 1 iteration Wednesday, January 11, 17 int v1; ... main( ) { T1 *v2 = malloc(sizeof(T1)); ... f1( ); } void f1( ) { int v3; #pragma omp parallel for for (i=0; i < n; i++) { int v4; T2 *v5 = malloc(sizeof(T2)); v2 = (T1) v5 }} v2 points to the T2 allocated by t1 if t1 executes the statement v2=(T1) v5; last statics and globals: v1 heap T1 global stack main: v2 foo: v3 t0 stack index: i v4, v5 t1 stack index: i v4, v5 T2 T2 A slightly different example -- after each thread has run at least 1 iteration Wednesday, January 11, 17 int v1; ... main( ) { T1 *v2 = malloc(sizeof(T1)); ... f1( ); } void f1( ) { int v3; #pragma omp parallel for for (i=0; i < n; i++) { int v4; T2 *v5 = malloc(sizeof(T2)); v2 = (T1) v5 }} v2 points to the T2 allocated by t1 if t1 executes the statement v2=(T1) v5; last statics and globals: v1 heap T1 global stack main: v2 foo: v3 t0 stack index: i v4, v5 t1 stack index: i v4, v5 T2 T2 A slightly different example -- after each thread has run at least 1 iteration The problem with this is that there is a race on the assignment to the v2 variable Races are evil, to be avoided, never to be done except in the rarest of conditions Wednesday, January 11, 17 Another problem with this code int v1; ... main( ) { T1 *v2 = malloc(sizeof(T1)); ... f1( ); } void f1( ) { int v3; #pragma omp parallel for for (i=0; i < n; i++) { int v4; T2 *v5 = malloc(sizeof(T2)); }} Storage, assuming two threads red is shared, green is private to thread 0, blue is private to thread 1 statics and globals: v1 heap ... ... T1 global stack main: v2 foo: v3 T2 T2 T2 T2 T2 T2 T2 T2 T2 Wednesday, January 11, 17 Querying the number of physical processors • Can query the number of physical processors • returns the number of cores on a multicore machine • returns the number of possible hyperthreads on a hyperthreaded machine int omp_get_num_procs(void); Wednesday, January 11, 17 Setting the number of threads • Number of threads can be more or less than the number of processors • if less, some processors or cores will be idle • if more, more than one thread will execute on a core/ processor • Operating system and runtime will assign threads to cores • No guarantee same threads will always run on the same cores • Default is number of threads equals number of cores controlled by the OS image (typically #cores on node/ processor) int omp_set_num_threads(int t); Wednesday, January 11, 17 Making more than the parallel for index private int i, j; for (i=0; i)
int i, j;
#pragma omp parallel for private(j)
for (i=0; i 0 dimensions

• Reductions on commutative operations can
be done in parallel

Wednesday, January 11, 17

A parallel reduction

a[99]a[24] a[25] a[49] a[50] a[74] a[75] …a[0] ………

t[0] =
+a[0:24]

t[3] =
+a[75:99]

t[2] =
+a[50:74]

t[1] =
+a[25:49]

tmp = t[0]
for (i = 1, i < 4; i++) tmp += t[i]; 25 4 speedup = 100/29 = 3.45 O(P) to sum the partial sums Wednesday, January 11, 17 A better parallel reduction a[99]a[24] a[25] a[49] a[50] a[74] a[75] ...a[0] ......... t[0] = +a[0:24] t[3] = +a[74:99] t[2] = +a[50:74] t[1] = +a[25:49] t[0] += t[1] 25 t[2] += t[3] t[0] += t[2] 1 1 speedup = 100/27 = 3.7 Wednesday, January 11, 17 How can we do this in OpenMP? double t[4] = {0.0, 0.0, 0.0, 0.0} int omp_set_num_threads(4); #pragma omp parallel for for (i=0; i < n; i++) { t[omp_get_thread_num( )] += a[i]; } avg = 0; for (i=0; i < 4; i++) } avg += t[i]; } avg = avg / n; This is getting messy and we still are using a O(#threads) summation of the partial sums. parallel serial OpenMP function Wednesday, January 11, 17 OpenMP provides a better way • Reductions are common enough that OpenMP provides support for them • reduction clause for omp parallel pragma • specify variable and operation • OpenMP takes care of creating temporaries, computing partial sums, and computing the final sum Wednesday, January 11, 17 Dot product example t=0; for (i=0; i < n; i++) { t = t + a[i]*c[i]; } t=0; #pragma omp parallel for reduction(+:t) for (i=0; i < n; i++) { t = t + (a[i]*c[i]); } OpenMP makes t private, puts the partial sums for each thread into t, and then forms the full sum of t as shown earlier Wednesday, January 11, 17 Restrictions on Reductions Operations on the reduction variable must be of the form x = x op expr x = expr op x (except subtraction) x binop = expr x++ ++x x-- --x • x is a scalar variable in the list • expr is a scalar expression that does not reference x • op is not overloaded, and is one of +, *, -, /, &, ^, |, &&, || • binop is not overloaded, and is one of +, *, -, /, &, ^, | Wednesday, January 11, 17 Why the restrictions on where t can appear? #pragma omp parallel for reduction(+:t) // each element of a[i] = 1 for (i=0; i1000)
for (i=0; i < n; i++) { t = t + (a[i]*c[i]); } Wednesday, January 11, 17 Assigning iterations to threads (thread scheduling) • The schedule clause can guide how iterations of a loop are assigned to threads • Two kinds of schedules: • static: iterations are assigned to threads at the start of the loop. Low overhead but possible load balance issues. • dynamic: some iterations are assigned at the start of the loop, others as the loop progresses. Higher overheads but better load balance. • A chunk is a contiguous set of iterations Wednesday, January 11, 17 The schedule clause - static • schedule(type[, chunk]) where “[ ]” indicates optional • (type [,chunk]) is • (static): chunks of ~ n/t iterations per thread, no chunk specified. The default. • (static, chunk): chunks of size chunk distributed round-robin. No chunk specified means chunk = 1 Wednesday, January 11, 17 The schedule clause - dynamic • schedule(type[, chunk]) where “[ ]” indicates optional • (type [,chunk]) is • (dynamic): chunks of size of 1 iteration distributed dynamically • (dynamic, chunk): chunks of size chunk distributed dynamically Wednesday, January 11, 17 Static Chunk = 1 1, 4, 7, 10, 130, 3, 6, 9, 12 2, 5, 8, 11, 14 thread 0 thread 1 thread 2 Chunk = 2 2, 3, 8, 9, 14, 150, 1, 6, 7, 12, 13 4, 5, 10, 11, 16, 17 thread 0 thread 1 thread 2 With no chunk size specified, the iterations are divided as evenly as possible among processors, with one chunk per processor With dynamic chunks go to processors as work needed. Wednesday, January 11, 17 The schedule clause • schedule(type[, chunk]) (type [,chunk]) is • (guided,chunk): uses guided self scheduling heuristic. Starts with big chunks and decreases to a minimum chunk size of chunk • runtime - type depends on value of OMP_SCHEDULE environment variable, e.g. setenv OMP_SCHEDULE=”static,1” Wednesday, January 11, 17 Guided with two threads example 31 2 4 65 7 8 9 Wednesday, January 11, 17 Dynamic schedule with large blocks 3 1 2 4 65 7 8 9 Large blocks reduce scheduling costs, but lead to large load imbalance Wednesday, January 11, 17 Dynamic schedule with small blocks Small blocks have a smaller load imbalance, but with higher scheduling costs. Would like the best of both methods. 1 3 5 7 9 11 23 25 27 . . . Thread 0 2 4 6 8 10 12 24 26 . . . Thread 1 Wednesday, January 11, 17 Guided with two threads By starting out with larger blocks, and then ending with small ones, scheduling overhead and load imbalance can both be minimized. 1 2 34 56 78 9 Wednesday, January 11, 17 67 Program Translation for Microtasking Scheme Subroutine x ... C$OMP PARALLEL DO DO j=1,n a(j)=b(j) ENDDO … END Subroutine x … call scheduler(1,n,a,b,loopsub) … END Subroutine loopsub(lb,ub,a,b) integer lb,ub DO jj=lb,ub a(jj)=b(jj) ENDDO END Wednesday, January 11, 17 68 Parallel ExecutionScheme • Most widely used: Microtasking scheme Main task Helper tasks Main task creates helpers Parallel loop Parallel loop Wake up helpers, grab work off of the queue Wake up helpers, grab work off of the queue Barrier, helpers go back to sleep Barrier, helpers go back to sleep Wednesday, January 11, 17 The nowait clause #pragma omp parallel for for (i=0; i < n; i++) { if (a[i] > 0) a[i] += b[i];
}
barrier here by default
#pragma omp parallel for nowait
for (i=0; i < n; i++) { if (a[i] < 0) a[i] -= b[i]; } with nowait i i j j without nowait i i j j time Only the static distribution with the same bounds guarantees the same thread will execute the same iterations from both loops. Wednesday, January 11, 17 The sections pragma Used to specify task parallelism #pragma omp parallel sections { #pragma omp section /* optional */ { v = f1( ) w = f2( ) } #pragma omp section v = f3( ) } v = f1( ) w = f2() v = f3( ) Wednesday, January 11, 17 The parallel pragma #pragma omp parallel private(w) { w = getWork (q); while (w != NULL) { doWork(w); w = getWork(q); } } • every processor executes the statement following the parallel pragma • Parallelism of useful work in the example because independent and different work pulled of of q • q needs to be thread safe Wednesday, January 11, 17 The parallel pragma #pragma omp parallel private(w) { #pragma omp critical w = getWork (q); while (w != NULL) { doWork(w); #pragma omp critical w = getWork(q); } } • If data structure pointed to by q is not thread safe, need to synchronize it in your code • One way is to use a critical clause single and master clauses exist. Wednesday, January 11, 17 The single directive Requires statement following the pragma to be executed by the master thread. Differs from critical in that critical lets the statement execute on many threads, just one at a time. #pragma omp parallel private(w) { w = getWork (q); while (w != NULL) { doWork(w); w = getWork(q); } #pragma omp single fprintf(“finishing work”); } Wednesday, January 11, 17 The master directive Requires statement following the pragma to be executed by the master thread. Often the master thread is thread 0, but this is implementation dependent. Master thread is the same thread for the life of the program. #pragma omp parallel private(w) { w = getWork (q); while (w != NULL) { doWork(w); w = getWork(q); } #pragma omp master fprintf(“finishing work”); } Wednesday, January 11, 17 Cannot use single/ master with for Many different instances of the single #pragma omp parallel for for (i=0; i < n; i++) { if (a[i] > 0) {
a[i] += b[i];

#pragma omp single
printf(“exiting”);

}
}

Wednesday, January 11, 17

Does OpenMP provide a way
to specify:

• what parts of the program execute in parallel with one
another

• how the work is distributed across different cores
• the order that reads and writes to memory will take

place

• that a sequence of accesses to a variable will occur
atomically or without interference from other threads.

• And, ideally, it will do this while giving good performance
and allowing maintainable programs to be written.

Wednesday, January 11, 17

What executes in parallel?

c = 57.0;
for (i=0; i < n; i++) { a[i] = c + a[i]*b[i] } c = 57.0 #pragma omp parallel for for (i=0; i < n; i++) { a[i] = + c + a[i]*b[i] } • pragma appears like a comment to a non-OpenMP compiler • pragma requests parallel code to be produced for the following for loop Wednesday, January 11, 17 The order that reads and writes to memory occur c = 57.0 #pragma omp parallel for schedule(static) for (i=0; i < n; i++) { a[i] = c + a[i]*b[i] } #pragma omp parallel for schedule(static) for (i=0; i < n; i++) { a[i] = c + a[i]*b[i] } • Within an iteration, access to data appears in-order • Across iterations, no order is implied. Races lead to undefined programs • Across loops, an implicit barrier prevents a loop from starting execution until all iterations and writes (stores) to memory in the previous loop are finished • Parallel constructs execute after preceding sequential constructs finish Wednesday, January 11, 17 Relaxing the order that reads and writes to memory occur c = 57.0 #pragma omp parallel for schedule(static) nowait for (i=0; i < n; i++) { a[i] = c[i] + a[i]*b[i] } #pragma omp parallel for schedule(static) for (i=0; i < n; i++) { a[i] = c[i] + a[i]*b[i] } The nowait clause allows a thread to begin executing its part of the code after the nowait loop as soon as it finishes its part of the nowait loop no barrier Wednesday, January 11, 17 Accessing variables without interference from other threads #pragma omp parallel for for (i=0; i < n; i++) { a = a + b[i] } Dangerous -- all iterations are updating a at the same time -- a race (or data race). #pragma omp parallel for for (i=0; i < n; i++) { #pragma omp critical a = a + b[i]; } Stupid but correct -- critical pragma allows only one thread to execute the next statement at a time. Very inefficient -- but ok if you have enough work to make it worthwhile. Wednesday, January 11, 17