Intro to Parallel Computing
Topic 10 – OpenMP Examples, Models, SIMD
COSC 407: Intro to Parallel Computing
Topic 10 – OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing
Copyright By PowCoder代写 加微信 powcoder
Outline (Asynchronous)
Previously:
• Sections
• Scheduling Loops (static, dynamic, guided, auto)
• Ordered Iterations
• Some examples
• Matrix multiplication
• Max reduction
• Some More Examples
• Asides (Producer/Consumer) and Comments on
Topic 10 – OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing
Matrix Multiplication
Write a C program that uses OpenMP to perform parallel matrix multiplication. The code should do the following steps:
1. create three matrices A, B, and C and initializes A and B to some values of your choice
(e.g., a[i][j]=i+j and b[i][j]=i*j).
2. perform matrix multiplication.
3. print out the resultant matrix C.
if we have two matrices:
Then their matrix product is
Topic 10 – OpenMP Examples, Models, SIMD
COSC 407: Intro to Parallel Computing
Serial Version
int i, j, k; double a[NRA][NCA], b[NCA][NCB], double c[NRA][NCB];
// Initialize matrices
for (i = 0; i < NRA; i++)
for (j = 0; j < NCA; j++)
a[i][j] = i + j; for (i = 0; i < NCA; i++)
for (j = 0; j < NCB; j++) b[i][j] = i * j;
for (i = 0; i < NRA; i++)
for (j = 0; j < NCB; j++)
c[i][j] = 0;
// matrix multiplication
printf("Starting matrix multiplication...\n");
for (i = 0; i < NRA; i++)
for (j = 0; j < NCB; j++)
for (k = 0; k < NCA; k++)
c[i][j] += a[i][k] * b[k][j];
// Print results
printf("Result Matrix:\n");
for (i = 0; i < NRA; i++) { for (j = 0; j < NCB; j++)
printf("%6.2f ", c[i][j]); printf("\n");
Topic 10 - OpenMP Examples, Models, SIMD
COSC 407: Intro to Parallel Computing
Q1: which loops would you parallelize?
Q2: Is there any loop-carried dependencies?
Finding the
Following is a function that finds and returns the max integer in a given array
int max_serial(int *arr){ int max = arr[0];
for (i = 0; i < N; i++)
if (max < arr[i] )
max = arr[i];
return max;
Assume N is a constant = size of the array
Write an OpenMP code to parallelize the above function.
Topic 10 - OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing
a) Yes b) No
Finding the V.1
Will this code produce the required output?
int max_parallel(int *arr){ int max = arr[0];
#pragma omp parallel for
for (i = 0; i < N; i++) if (max < arr[i])
max = arr[i];
return max;
Topic 10 - OpenMP Examples, Models, SIMD
COSC 407: Intro to Parallel Computing
a) Thisisaperfec b) We might have c) Outputiscorre d) There is a synt
Finding the v.2
How about this Solution?
int max_parallel(int *arr){ int max = arr[0];
#pragma omp parallel for
for (i = 0; i < N; i++)
#pragma omp critical
if (max < arr[i])
max = arr[i];
return max; }
Topic 10 - OpenMP Examples, Models, SIMD
COSC 407: Intro to Parallel Computing
a) Thisisagoodsolution
b) We might have a race condition.
c) Outputiscorrect,butthecodeismostlysequential
d) There is a syntax error. Finding the v.3
What do you think now?
int max_parallel(int *arr, int thread_count){
int shared_max = arr[0], i;
#pragma omp parallel num_threads(thread_count)
{ int local_max = arr[0]; //each thread gets a copy of local_max
#pragma omp for
for (i = 0; i < N; i++) //each thread finds it local max if (local_max < arr[i])
local_max = arr[i];
#pragma omp critical //one thread at a time, i.e. sequential if(shared_max < local_max)
shared_max = local_max;
return shared_max;
Only a “good” solution – there is a better way using reduction! This will be discussed shortly (not in the next slide though)
Topic 10 - OpenMP Examples, Models, SIMD
COSC 407: Intro to Parallel Computing
a) Thisisagoodsolution.
b) We might have a race condition.
c) Outputiscorrect,butthecodeismostlysequential d) There is a syntax error.
Finding the v.4
Will this code produce the required output?
int max_parallel(int *arr, int thread_count){
int local_max[thread_count];
#pragma omp parallel num_threads(thread_count)
{ int tid = omp_get_thread_num(); local_max[tid] = arr[0]; #pragma omp for
for (int i = 0; i < N; i++)
Assuming 2 threads only
if (local_max[tid]
return max;
Topic 10 – OpenMP Examples, Models, SIMD
a) Thisisagoodsolution.
b) We might have a race condition.
c) Outputiscorrect,butthecodeismostlysequential
• First introduced in OpenMP v3.1 • Remember:
Reduction Syntax:
reduction (
1. Provides a private copy of the variable for each thread
2. Combines the private results on exit
Topic 10 – OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing
Finding the v.5
Now, what do you think?
int max_parallel(int *arr){ int i, m = arr[0];
#pragma omp parallel for reduction(max:m)
for (i = 0; i < N; i++) if (m < arr[i])
m = arr[i];
Topic 10 - OpenMP Examples, Models, SIMD
COSC 407: Intro to Parallel Computing
a) Yes b) No
Finding the Min, Max and SUM
A function that returns 3 values: max, min, and sum in an array using OpenMP
void stats(int* arr, int* h, int* l, int* s) {
int i, high = arr[0], low = arr[0], sum = 0;
#pragma omp parallel for reduction(max:high) reduction(min:low) \ reduction(+:sum)
for (i = 0; i < N; i++){ if(high
*h = high; //returning three values *l = low;
Topic 10 – OpenMP Examples, Models, SIMD
COSC 407: Intro to Parallel Computing
Nested Loops: collapse clause
• Collapse directive turns nested loops into ONE big loop
• The iteration space is then divided according to the schedule
#pragma omp for collapse(n)
Where n is the number of loops to be collapsed
Topic 10 – OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing
for(j=0; j<4; j++)
printf("%d,%d ", i, j);
0,0 0,1 0,2 0,3
1,2 1,3 1,4
Example: collapse
#pragma omp parallel for private(j) num_threads(4) for(i=0; i<2; i++)
#pragma omp parallel for collapse(2) num_threads(4) for(i=0; i<2; i++)
for(j=0; j<4; j++) printf("T%d,i:%d,j:%d\n",omp_get_thread_num(), i, j);
0,0 0,1 0,2 0,3 1,1 1,2 1,3 1,4
T0 T1 T2 T3
Topic 10 - OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing
Nested Loops: collapse clause
• The specified number of loops must be indicated
• The loops must be perfectly nested; i.e., no code between
the loops which are collapsed.
• All counters in the collapsed loops are private (no need to add the private clause for them)
Note: collapse doesn’t always improve the performance – a smart compiler will automatically collapse nested loops IF needed.
Topic 10 - OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing
Bubble Sort
Starting from the first item, compare adjacent items and keep “bubbling” the larger one to the right. Repeat for remaining sublist.
for(k = list_length, k >= 2; k–) for(i = 0; i < k; i++)
if(a[i-1] > a[i]) swap(&a[i-1],&a[i]);
What is the complexity?
Bubble sort cannot be easily parallelized!
Why? (i.e. identify the loop carried dependency for both for- loops)
Topic 10 – OpenMP Examples, Models, SIMD
COSC 407: Intro to Parallel Computing
Odd-Even Transposition Sort
• A sequence of phases (N is number of elements in list a).
• Similar to bubble sort, but easier to parallelize
• Outer-loop has loop-carried dependence
• Inner loop doesn’t have loop-carried
dependence
Topic 10 – OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing
for(phase = 0, phase < N; phase++) if(phase % 2 == 0) //even phases
for(i = 1; i < N; i += 2)
if(a[i-1] > a[i]) swap(&a[i-1],&a[i]);
else //odd phases for(i = 1; i < N-1; i += 2)
if(a[i] > a[i+1]) swap(&a[i],&a[i+1]);
Odd-Even Transposition Sort
for(phase = 0, phase < N; phase++) if(phase % 2 == 0) //even phases
for(i = 1; i < N; i += 2)
if(a[i-1] > a[i]) swap(&a[i-1],&a[i]);
else //odd phases
for(i = 1; i < N-1; i += 2)
if(a[i] > a[i+1]) swap(&a[i],&a[i+1]);
Topic 10 – OpenMP Examples, Models, SIMD
No dependency between thread, thus no race condition
COSC 407: Intro to Parallel Computing
Odd-Even Sort – Attempt 1
for (phase = 0; phase < n; phase++) { if (phase % 2 == 0)
# pragma omp parallel for num_threads(thread_count) \ default(none) shared(a, n) private(i, tmp)
for (i = 1; i < n; i += 2){ if (a[i-1] > a[i]) {
tmp = a[i-1];
a[i-1] = a[i];
a[i] = tmp;
Q: Do we need any explicit barriers in this code?
join (implicit)
# pragma omp parallel for num_threads(thread_count) \ default(none) shared(a, n) private(i, tmp)
for (i = 1; i < n-1; i += 2) if (a[i] > a[i+1]) {
tmp = a[i+1];
a[i+1] = a[i];
a[i] = tmp;
join (implicit)
Topic 10 – OpenMP Examples, Models, SIMD
COSC 407: Intro to Parallel Computing
Odd-Even Sort – Attempt 2
# pragma omp parallel num_threads(thread_count) \ default(none) shared(a, n) private(i, tmp, phase)
for (phase = 0; phase < n; phase++) { if (phase % 2 == 0)
# pragma omp for
for (i = 1; i < n; i += 2){
if (a[i-1] > a[i]) { tmp = a[i-1];
a[i-1] = a[i];
a[i] = tmp; }
tells OpenMP to parallelize the for loop with the existing team of threads.
Reuse thread
# pragma omp for
for (i = 1; i < n-1; i += 2)
if (a[i] > a[i+1]) { tmp = a[i+1];
a[i+1] = a[i];
a[i] = tmp; }
Reuse thread
join (implicit)
Topic 10 – OpenMP Examples, Models, SIMD
COSC 407: Intro to Parallel Computing
Comparison of Performance
• Odd-even sort with two parallel for directives and two for directives.
(Times are in seconds, size of array is 20,000)
Thread count
Two parallel for directives
Two for directives
Topic 10 – OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing
Producer – Consumer Model
This model has two (or more) threads/processes work together on a shared resource (a buffer).
• Producer writes data to the buffer when it is not full
• Consumer reads data from buffer when it is not empty
consumer producer
Circular buffers are actually linear buffers in which the producer or consumer moves to the beginning whenever it reaches the end of the buffer.
Topic 10 – OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing
Condition Synchronization
• Condition synchronization is a mechanism to ensure that a process is blocked if some condition is not satisfied
For a consumer process
àCondition: the buffer is not empty.
For a producer process
àCondition: the buffer is not full
Topic 10 – OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing
Pseudo Code
#define NUM_ITEMS 30 // number of items produced and consumed
int empty = 1, full = 0; // at beginning, buffer is empty and not full
int main(){
#pragma omp parallel
assign several threads to run produce(), and several to consume() }
PRODUCER code
void put(T item){
//add item to buffer
//set empty to 0 (buffer is not empty) //set full to 1 IF buffer is full
void produce(){
while (i < NUM_ITEMS) {
#pragma omp critical(one)
if (!full) { // if full, don’t do
// anything
put(item); // puts data into buffer i++; // incr. only if item
CONSUMER code
// read item from buffer
// set full to 0 (buffer is not full) // set empty to 1 IF buffer has not // items
// return item;
void consume() {
while (j < NUM_ITEMS) {
#pragma omp critical(two) if (!empty) {
x = get(); // read data from // buffer
j++; // incr. only if item // received
Topic 10 - OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing
Produce-Consumer Producer
#define BUFFER_SIZE 10 #define NUM_ITEMS 30
char buffer[BUFFER_SIZE]; int i_in = 0, i_out = 0; int num_items = 0;
int empty = 1, full = 0; int i =0, j=0;
//producer code
void put(char ch) {
buffer[i_in] = ch;
i_in = (i_in + 1) % BUFFER_SIZE; // incr. producer index
//keep track of number of items in buffer
num_items++; // # of items in buffer if(num_items == 1) empty = 0; // buffer is not empty anymore if(num_items==BUFFER_SIZE) full = 1; // buffer is now full
void produce(int tid) {
while (i < NUM_ITEMS) {
#pragma omp critical(one)
if (!full) {
char ch = 'A' + rand()%26;
printf("T%d:Produced %c\t\tnow is %s\n",tid,ch,full?"Full":"Not full"); i++; // only increment if item added
Topic 10 - OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing
// Buffer size
// total number of producing and consuming
//Two buffer indexes for producer/consumer //how many items in the buffer
//buffer is empty and not full at beginning //track any producer or consumer action
// put character in circular buffer
This part should be executed atomically for all producers – ok to run in parallel with consumers. e.g. you don't want more than one producer to check !full condition and proceed (what if there is only one spot left and two threads proceed because !full is true?)
Condition synchronization: can only produce if buffer is not full – if full, skip and try again later in the loop.
Produce-Consumer
char get() {
char ch = buffer[i_out]; // read item from circular buffer
i_out = (i_out + 1) % BUFFER_SIZE; // incr. consumer index //keep track of number of items in buffer
num_items--; // decr. # of items if (num_items == BUFFER_SIZE-1) full = 0;// buffer is not full anymore
if (num_items == 0) empty = 1;
return ch;
void consume(int tid) {
while (j < NUM_ITEMS) {
#pragma omp critical(two)
if (!empty) {
// buffer is now empty
Only one consumer can get
ch = get();
printf("T%d:Consumed %c\t\tnow is %s\n",tid,ch,empty?"Empty":"Not empty"); j++;
int main(){
#pragma omp parallel firstprivate(i,j) num_threads(8) {
int tid = omp_get_thread_num(); if(tid< omp_get_num_threads()/2)
produce(tid); //1st half of threads producing else
consume(tid); //2nd half consume
Topic 10 - OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing
Produce-Consumer
PHQGHU.........
T0: Producing
T1: Consuming
T2: Producing
T2: Producing
T1: Consuming
T0: Producing
T1: Consuming
T0: Producing
T1: Consuming
T0: Producing
T5: Consuming
T7: Consuming
Topic 10 - OpenMP Examples, Models, SIMD
P P H Q H G Q W G U
now is Not full
now is Empty
now is Not full
now is Not full
now is Not empty
now is Not full
now is Not empty
now is Not full
now is Not empty
now is Not full
now is Not empty
now is Empty
COSC 407: Intro to Parallel Computing
Consumer-Producer Applications
Computer Science Applications:
• When data is set to a printer, a producer writes
data to the buffer and the printer’s process reads
data from buffer
• When we have several “producer” threads and
several “consumer” threads
• Producer threads might “produce” requests for data.
• Consumer threads might “consume” the request by
finding or generating the requested data
Topic 10 - OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing
Challenge Producer-Consumer
• The code we saw earlier is not the best implementation of the producer- consumer model.
• Threads will be busy always even if they are practically doing nothing. • Example: if queue is empty, a consumer will still be busy running the loop
and checking the if condition.
• OpenMP cannot arbitrarily start/stop threads.
• Idea: locks may be used to implement condition synchronization, but we will not discuss this here.
• Other libraries can do this.
• Java and POSIX Threads can do this
• POSIX threads can explicitly create, run, stop, and destroy threads.
OpenMP is best when you want a fixed number of threads which will fully use the CPU, and the threads are more-or-less doing the same thing.
Topic 10 - OpenMP Examples, Models, SIMD COSC 407: Intro to Parallel Computing
Message-Passing: The Process
1) Master thread creates a shared array of message queues: one for each thread.
2) Start the threads using a parallel directive. Each thread allocates space for its queue.
3) After ALL the threads are done step 2, each thread will
• Send all its messages to other threads (putting them in other threads’ queues)
Shared array
• Receive all message sent to it
Topic 10 - OpenMP Examples, Models, SIMD
for(i=0; i