程序代写代做代考 c/c++ flex compiler c++ Fortran cache Introduction to OpenMP

Introduction to OpenMP

Amitava Datta

University of Western Australia

Compiling OpenMP programs

I OpenMP programs written in C are compiled by (for
example): gcc -fopenmp -o prog1 prog1.c

I We have assumed the name of the C file is prog1.c and the
name of the executable is prog1

I The compiler will look for OpenMP directives in your program
for generating code.

I No action is taken if there are no OpenMP directives in your
program.

pragma directive

If you want the compiler to generate code using OpenMP, you
have to use the pragma directive

#include

#include

int main()

{

#pragma omp parallel

{

printf(“The parallel region is executed by thread

%d\n”,omp_get_thread_num());

}

}

#pragma parallel

I When the compiler encounters the parallel directive, it
generates multi-threaded code.

I How many threads will execute the code will depend on how
many threads are specified (more later).

I The default is number of threads equal to number of cores.

I The parallel region is executed by thread 4

The parallel region is executed by thread 3

The parallel region is executed by thread 7

The parallel region is executed by thread 2

The parallel region is executed by thread 5

The parallel region is executed by thread 1

The parallel region is executed by thread 6

The parallel region is executed by thread 0

I But I have only 4 cores in my machine.

Hyperthreading

I Hyperthreading is an Intel technology that treats each
physical core as two logical cores.

I Two threads are executed at the same time (logically) on the
same core.

I Processors (or cores) do not execute instructions in every
clock cycle.

I There is an opportunity to execute another instruction from
another thread when the core is idle.

I Hyperthreading schedules two threads to every core.

I So, my processor has 4 physical cores and 8 logical cores.

Hyperthreading

I The purpose of hyperthreading is to improve the throughput
(processing more per unit time).

I This may or may not happen. In fact hyperthreading may
actually have slower performance.

I Your process will run slower when hyperthreading is turned on.

I It all depends on how well the L1 cache is shared.

I It is possible to turn hyperthreading off through the BIOS
(more on lab sheet).

Threads run independently

I There is only one thread until the parallel directive is
encountered.

I 7 other threads are launched at that point.

I Thread 0 is usually the master thread (that spawns the other
threads.

I The parallel region is enclosed in curly brackets.

I There is an implied barrier at the end of the parallel region.

What is a barier?

I A barrier is a place in the process where all threads must
reach before further processing occurs.

I Threads may run away without barriers and it is necessary
many times to have barriers at different places in a process.

I Barriers are sometime implicit (like here), barriers sometime
can be removed (more later).

I Barriers are expensive in terms of run time performance. A
typical barrier may take hundreds of clock cycles to ensure
that all threads have reached the barrier.

I It is better to remove barriers, but this is fraught with danger.

A variation of our code

#include

#include

int main()

{

#pragma omp parallel

{

if (omp_get_thread_num()==3) sleep(1);

printf(“The parallel region is executed by thread %d\n”,omp_get_thread_num());

}

}

Output

The parallel region is executed by thread 4

The parallel region is executed by thread 7

The parallel region is executed by thread 1

The parallel region is executed by thread 2

The parallel region is executed by thread 5

The parallel region is executed by thread 6

The parallel region is executed by thread 0

The parallel region is executed by thread 3

I Thread 3 is now suspended for 1 second, so all other threads
complete before thread 3.

5

Outline

Introduction to OpenMP
Creating Threads
Synchronization
Parallel Loops
Synchronize single masters and stuff
Data environment
Schedule your for and sections
Memory model
OpenMP 3.0 and Tasks

6

OpenMP* Overview:

omp_set_lock(lck)

#pragma omp parallel for private(A, B)

#pragma omp critical

C$OMP parallel do shared(a, b, c)

C$OMP PARALLEL REDUCTION (+: A, B)

call OMP_INIT_LOCK (ilok)

call omp_test_lock(jlok)

setenv OMP_SCHEDULE “dynamic”

CALL OMP_SET_NUM_THREADS(10)

C$OMP DO lastprivate(XX)

C$OMP ORDERED

C$OMP SINGLE PRIVATE(X)

C$OMP SECTIONS

C$OMP MASTERC$OMP ATOMIC

C$OMP FLUSH

C$OMP PARALLEL DO ORDERED PRIVATE (A, B, C)

C$OMP THREADPRIVATE(/ABC/)

C$OMP PARALLEL COPYIN(/blk/)

Nthrds = OMP_GET_NUM_PROCS()

!$OMP BARRIER

OpenMP: An API for Writing Multithreaded
Applications

A set of compiler directives and library
routines for parallel application programmers
Greatly simplifies writing multi-threaded (MT)

programs in Fortran, C and C++
Standardizes last 20 years of SMP practice

* The name “OpenMP” is the property of the OpenMP Architecture Review Board.

7

OpenMP Basic Defs: Solution Stack

OpenMP Runtime library

OS/system support for shared memory and threading

S
ys

te
m

la
ye

r

Directives,
Compiler

OpenMP library Environment variablesPr
og

.
L a

y e
r

Application

End User

U
se

r
l a

y e
r

Shared Address Space

Proc3Proc2Proc1 ProcN

H
W

8

OpenMP core syntax
Most of the constructs in OpenMP are compiler
directives.

#pragma omp construct [clause [clause]…]
Example

#pragma omp parallel num_threads(4)
Function prototypes and types in the file:

#include
Most OpenMP* constructs apply to a
“structured block”.

Structured block: a block of one or more statements
with one point of entry at the top and one point of
exit at the bottom.
It’s OK to have an exit() within the structured block.

9

Exercise 1, Part A: Hello world
Verify that your environment works
Write a program that prints “hello world”.

void main()
{

int ID = 0;

printf(“ hello(%d) ”, ID);
printf(“ world(%d) \n”, ID);

}

void main()
{

int ID = 0;

printf(“ hello(%d) ”, ID);
printf(“ world(%d) \n”, ID);

}

10

Exercise 1, Part B: Hello world
Verify that your OpenMP environment works
Write a multithreaded program that prints “hello world”.

void main()
{

int ID = 0;

printf(“ hello(%d) ”, ID);
printf(“ world(%d) \n”, ID);

}

void main()
{

int ID = 0;

printf(“ hello(%d) ”, ID);
printf(“ world(%d) \n”, ID);

}

#pragma omp parallel

{

}

#include “omp.h”

Switches for compiling and linking

-fopenmp gcc

-mp pgi

/Qopenmp intel

11

Exercise 1: Solution
A multi-threaded “Hello world” program

Write a multithreaded program where each
thread prints “hello world”.

#include “omp.h”
void main()
{

#pragma omp parallel
{

int ID = omp_get_thread_num();
printf(“ hello(%d) ”, ID);
printf(“ world(%d) \n”, ID);

}
}

#include “omp.h”
void main()
{

#pragma omp parallel
{

int ID = omp_get_thread_num();
printf(“ hello(%d) ”, ID);
printf(“ world(%d) \n”, ID);

}
}

Sample Output:
hello(1) hello(0) world(1)

world(0)

hello (3) hello(2) world(3)

world(2)

Sample Output:
hello(1) hello(0) world(1)

world(0)

hello (3) hello(2) world(3)

world(2)

OpenMP include fileOpenMP include file

Parallel region with default
number of threads

Parallel region with default
number of threads

Runtime library function to
return a thread ID.

Runtime library function to
return a thread ID.End of the Parallel regionEnd of the Parallel region

12

OpenMP Overview:
How do threads interact?
OpenMP is a multi-threading, shared address
model.

– Threads communicate by sharing variables.
Unintended sharing of data causes race
conditions:

– race condition: when the program’s outcome
changes as the threads are scheduled differently.

To control race conditions:
– Use synchronization to protect data conflicts.

Synchronization is expensive so:
– Change how data is accessed to minimize the need

for synchronization.

13

Outline

Introduction to OpenMP
Creating Threads
Synchronization
Parallel Loops
Synchronize single masters and stuff
Data environment
Schedule your for and sections
Memory model
OpenMP 3.0 and Tasks

14

OpenMP Programming Model:
Fork-Join Parallelism:

Master thread spawns a team of threads as needed.

Parallelism added incrementally until performance goals
are met: i.e. the sequential program evolves into a
parallel program.

Parallel Regions
Master
Thread
in red

A Nested
Parallel
region

A Nested
Parallel
region

Sequential Parts

15

Thread Creation: Parallel Regions

You create threads in OpenMP* with the parallel
construct.
For example, To create a 4 thread Parallel region:

double A[1000];
omp_set_num_threads(4);
#pragma omp parallel
{

int ID = omp_get_thread_num();
pooh(ID,A);

}

Each thread calls Each thread calls pooh(ID,A) for for ID = = 0 to to 3

Each thread
executes a
copy of the
code within

the
structured

block

Each thread
executes a
copy of the
code within

the
structured

block

Runtime function to
request a certain
number of threads

Runtime function to
request a certain
number of threads

Runtime function
returning a thread ID

Runtime function
returning a thread ID

* The name “OpenMP” is the property of the OpenMP Architecture Review Board

16

Thread Creation: Parallel Regions
You create threads in OpenMP* with the parallel
construct.
For example, To create a 4 thread Parallel region:

double A[1000];

#pragma omp parallel num_threads(4)
{

int ID = omp_get_thread_num();
pooh(ID,A);

}

Each thread calls Each thread calls pooh(ID,A) for for ID = = 0 to to 3

Each thread
executes a
copy of the
code within

the
structured

block

Each thread
executes a
copy of the
code within

the
structured

block

clause to request a certain
number of threads

clause to request a certain
number of threads

Runtime function
returning a thread ID

Runtime function
returning a thread ID

* The name “OpenMP” is the property of the OpenMP Architecture Review Board

17

Thread Creation: Parallel Regions example

Each thread executes the
same code redundantly.

double A[1000];
omp_set_num_threads(4);
#pragma omp parallel
{

int ID = omp_get_thread_num();
pooh(ID, A);

}
printf(“all done\n”);omp_set_num_threads(4)

pooh(1,A) pooh(2,A) pooh(3,A)

printf(“all done\n”);

pooh(0,A)

double A[1000];

A single
copy of A
is shared
between all
threads.

A single
copy of A
is shared
between all
threads.

Threads wait here for all threads to
finish before proceeding (i.e. a barrier)

Threads wait here for all threads to
finish before proceeding (i.e. a barrier)

* The name “OpenMP” is the property of the OpenMP Architecture Review Board

27

Discussed later

SPMD vs. worksharing
A parallel construct by itself creates an SPMD
or “Single Program Multiple Data” program …
i.e., each thread redundantly executes the
same code.
How do you split up pathways through the
code between threads within a team?

This is called worksharing
– Loop construct
– Sections/section constructs
– Single construct
– Task construct …. Coming in OpenMP 3.0

28

The loop worksharing Constructs
The loop workharing construct splits up loop
iterations among the threads in a team
#pragma omp parallel
{
#pragma omp for

for (I=0;I
void input_parameters (int, int); // fetch values of input parameters
void do_work(int, int);

void main()
{

int Nsize, choice;

#pragma omp parallel private (Nsize, choice)
{

#pragma omp single copyprivate (Nsize, choice)
input_parameters (Nsize, choice);

do_work(Nsize, choice);
}

}

#include
void input_parameters (int, int); // fetch values of input parameters
void do_work(int, int);

void main()
{

int Nsize, choice;

#pragma omp parallel private (Nsize, choice)
{

#pragma omp single copyprivate (Nsize, choice)
input_parameters (Nsize, choice);

do_work(Nsize, choice);
}

}

Used with a single region to broadcast values of privates
from one member of a team to the rest of the team.

37

Synchronization: Barrier
Barrier: Each thread waits until all threads arrive.

#pragma omp parallel shared (A, B, C) private(id)
{

id=omp_get_thread_num();
A[id] = big_calc1(id);

#pragma omp barrier
#pragma omp for

for(i=0;ileft)

#pragma omp task
postorder(p->left);

if (p->right)
#pragma omp task

postorder(p->right);
#pragma omp taskwait // wait for descendants

process(p->data);
}

Parent task suspended until children tasks complete

Task scheduling point

3.0

95

Task switching

Certain constructs have task scheduling points
at defined locations within them
When a thread encounters a task scheduling
point, it is allowed to suspend the current task
and execute another (called task switching)
It can then return to the original task and
resume

3.0

96

Task switching example

#pragma omp single
{

for (i=0; i